Bird song synthesis using neural vocoders

(1)

University of Eastern Finland School of Computing

Bird song synthesis using neural vocoders

Rhythm Rajiv Bhatia (308847)

(2)

Abstract

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Computer Science

Student: Rhythm Rajiv Bhatia Master’s Thesis,(30 p.)

Supervisors of the Master’s Thesis: PhD Prof. Tomi H. Kinnunen June 2021

Keywords: neural network, vocoder, analysis-resynthesis, bird audio

Abstract: This thesis presents a study on Bird Song Synthesis with Deep Learning- grounded to Bird Vocalization. Right from the traditional methods for audio generation methods like speech synthesis and voice conversion systems to using the voice activity detection system, and other crucial factors that help in thorough understanding and utilization of the concept. This increased versatility has expanded the application of vocoders beyond speech to other domains (such as music). The dataset used here is xccoverbl and it has around 88 single labelled species, over 264 audio files with a total duration of 4.9 hours. Out of these 88 species 10 species were chosen for testing purpose.

This work discusses the estimation performance on a variety of common bird vocalizations based on the various factors such as pitch, fundamental frequency, vocalizations. The presence of multiple tones or partials in the bird vocalization simultaneously, extends the frequency sweeps through different values of octaves, and the modulations encountered in pitch is quite rapid. These are some of the difficulties encountered in the generation of the bird song. However, mentioned in the thesis are carefully implementedneural waveform modellingfor bird audio synthesis. The traditional approach to TTS and voice conversion uses fixed (signal processing based) operations to represent speech waveforms using a small number of parameters — such as spectral envelope, fundamental frequency and aperiodicity. Author has compared vocoder models through analysis-resynthesis experiments, including objective and subjective evaluation. This work provides initial analysis-resynthesis experiments with a traditional vocoder (WORLD) and two neural (WaveNet autoencoder, parallel wavegan) vocoders in the context of bird vocalizations. The initial subjective results indicate no difference in the three vocoders in terms of species identification (ABX test). Nonetheless, the WORLD vocoder samples were rated higher in terms of retaining bird-like qualities (MOS test). So, overall MOS score out of 5 for WORLD vocoder is 4.12, wavenet autoencoder is 3.49, parallel wavegan is 3.59 and lastly for natural the MOS score is 4.44.

On the other hand, for ABX test the accuracy(%) for WORLD vocoder is 71.17, wavenet autoencoder 71.76, parallel wavegan 70.58 and for natural 73.53.

(3)

Foreword

This work is partially sponsored by Academy of FInland. The author would like to express her gratitude for the listeners who took part to the perceptual experiments.

I would like to express my gratitude to my supervisor Prof. Tomi H. Kinnunen for his constant guidance and support.

The author would also like to thank Dr. Rosa González Hautamäki for her valuable inputs concerning the listening setup.

I am thankful to the staff of University of Eastern Finland who made it possible to work online with so much ease during this difficult time.

I am also grateful for the support received by my family and friends especially Tushar, Seema, Deepak, Abhishek, Sushma and Rajiv.

Thank you all for your consistent guidance, patience and support.

(4)

Abbreviations

TTS Text-to-Speech

HMM Hidden Markov Model

GMM Gaussian Mixture Model

AUD Acoustic Unit Discovery FFT Fast Fourier Transform VAE Variational Autoencoder

VQ Vector Quantization

VQ-VAE Vector Quantized Variational Au- toencoder

GAN Generative Adversarial Network

NN Neural Network

CNN Convolutional Neural Network

RAM Random Access Memory

PSU Power Supply Unit

ASR Automatic Speech Recognition AUD Acoustic Unit Discovery

(5)

1. Introduction

Speech synthesis [94] is the process of artificial production of speech. It refers to techniques for artificial generation of speech, using electronic hardware or software. There are certaintraditional strategiesfor speech synthesis, they are mainly of two types:concatenativeandparametric. The concatenative approach utilizes audios from a large database for the generation of a new audio. If a different or a new audio is required then there is a requirement of a new audio database again with lots of audio files. This limits the computing capability or thescalabilityof the concatenative approach. The parametric approach, in turn, needs a recorded audio and a function with a set of parameters that can be modified to change the audio as per the requirement. Both of these approaches were predominant in the past for speech synthesis.

There are new ways to perform audio synthesis using deep learning methodology [41]. Some of the most commonly addressed modern approaches in speech synthesis include the following:

1. WaveNet: A Generative Model for Raw Audio [69]

2. Tacotron: Towards end-to-end Speech Synthesis [90]

3. Natural TTS Synthesis via WaveNet by conditioning on melspectrogram predictions[79]

4. MelGAN using melspectrograms to generate audios using GANs [50]

5. Parallel WaveNet: Fast High-Fidelity Speech Synthesis [92]

These techniques will be discussed in later chapters.

1.1 Motivation and Problem Statement

The development of birdsong as a subject requires exhaustive study. The richness and diversity of the birdsong has fascinated basic research on the communicative function of birdsong. Moreover, acoustic monitoring of bird populations is considered as an attractive and desirable option for assessing biodiversity and promoting nature conservation activities.

This is also because birds are more easily detectable through acoustic rather than visual cues. Birds interact majorly by expressing vocally. The bird-callings that humans often hear are to warn regarding an approaching danger, or to identify a specific individual, mark their territory and many more events. [84].

Generation of vocalizations for birds could help in investigation, and not only to understand the nature of the birds themselves, it also helps in capturing various features of its environment both biotic and abiotic, which is not possible by simply recording the spatial position or movement of the individuals. This information is used for various research purposes such as conservation of species, to monitor biodiversity, impact of the anthropogenic noise, although analyzing the soundscapes and signals still remains a challenge [62].

Humans have been successful in the generation of complex sequential behaviors like speech and music. Sequences of speech and music is built on the basis of action sequence that are performed along with the complex sequencing rules. It

(8)

Besides ecology, bird sounds have elevated the curiosity of the machine learning and signal processing scholars too.

This has comprehended the task of species recognition, [81] and locating bird segments from background (analogous to speech activity detection) [82]. Acoustic wildlife data (indeed, truein-the-wilddata) is often collected from a distance and tends to have low signal-to-noise ratio (SNR), includes overlapping sound events and lacks labels, to name some of the technical challenges. Nevertheless, the accessibility and availability of enormous and public annotated bird sound data (such asXeno-Cantocollection [91]) has enabled research into machine learning for bird sounds.

While a number of recognition, segmentation and labeling approaches have been proposed,generationof bird sounds seems to have received less attention. There are, however, many potential applications ranging from games, movies and virtual reality to education and robotics where flexible generation of bird sounds (more generally any animal vocalizations) can be useful. There have been efforts on physical model based synthesis of general mammal sounds [58] along with adaptation of human pitch tracking to birdsong [65], to name a few. Besides physics and signal processing directed approaches, there has been also work that draws inspiration fromtext-to-speech(TTS) methods [12,34].

Normally, such ‘bird TTS’ studies, however, are in the minority. The apparent reason is that even if birdsong and speech both serve a communicative function, and consists of structured sequential data. One reason whybirdsong synthesis has received such less attention as compared to therecognition methodscould be because the task itself is not precisely defined. It is useful to contrast the similarities and differences of birdsong and another extensively studied complex bio-acoustic signal – human speech. Both serve acommunicative function; just as humans communicate verbal (and non-verbal) signals, birds communicate messages related to food, mating, territorial defense, and danger to name a few. Concerning the acoustic level, both speech and birdsong are structured signals composed of complex sequences of elementary units such as syllables and phrases. The apparent difference of the two, however, is ‘bird language’

lacks a commonly-agreed, standard written form (ortography).Human language exists in both written and spoken forms, and the statistical association between these two enables tasks such as TTS or automatic speech recognition. In the limited number of bird TTS studies, the issue has been addressed through automatically derived acoustic units learnt with unsupervised techniques, each associated with an arbitrary discrete symbol (e.g. an integer). It is worthwhile to note that similar unsupervised acoustic unit discovery approaches have recently been addressed by the speech community [25,87,63].

While birdsong generation through statistical parametric speech synthesis has been addressed in the past, this thesis addresses bird vocalization generation from the perspective ofneural waveform modeling. The traditional approach to TTS and voice conversion uses fixed (signal processing based) operations to represent speech waveforms via small number of parameters — such as spectral envelope, fundamental frequency and aperiodicity. A major breakthrough was brought in 2016 by the introduction ofWaveNet[70], an approach to model raw waveform samples directly. WaveNet and numerous other subsequent neural waveform modeling approaches have shifted the focus of research in TTS. While the quality of the past synthetic speech suffered from many artifacts, speech produced by neural waveform models can be already indistinguishable from real human speech by listeners (e.g. [89]) and in many cases by automatic methods as well.

As the neural paradigm models raw waveform directly, these models have provided excellent results in modelling other acoustic signals beyond speech — such as music. On the other hand, unlike traditional vocoders, the neural models require (often time-consuming) training. Similar to any machine learning task, they can also be sensitive to the choices of training data, architectures and control parameters. As far as the author is aware, this work is the first to address birdsong generation using neural waveform models. Given the less-than-ideal characteristics of bird sounds and different spectral and temporal structure of human speech and birdsong, the question addressed here is whether neural vocoders are a direction worth looking into in the task of generating bird vocalizations. Bird vocalizations tend to have rapid F0 fluctuations, a complex temporal structure, but lack harmonicity (a key property of human speech and music).

The focus of this work is purposefully limited to thevocoderpart only. Vocoder is a critical component of more complete synthesizers. Given limited work in this new domain, the author of this thesis feels this focus is well justified. To this end, the author has chosen three modern vocoders — WORLD [59], WaveNet Autoencoder [28] and Parallel wavegan

(9)

environment, which is not feasible currently by recording the spatial position or the bird individual movements. Genera- tion of bird vocalizations is also useful for conservation research work, monitoring the biodiversity, anthropogenic noise effects. [80] [35].

1.2 Properties of Bird song

One of the main function of bird song is the mate attraction. Scientists have hypothesized that the evolution of bird song happens via sexual selection. The quality of the bird song is an indicator of the fitness of the individual [95].

Territory defense is another important function of bird vocalization. Territorial birds use songs for negotiation of bound- aries. Since song indicate quality and strength, bird individuals use it to determine their rivals’ quality and to decide whether to take up a fight or not. Birds use song repertoires, for complex communications amongst the individuals [31]

[40] [56]. Songs are arranged into severalphrases, which consist of series ofsyllables.

The vocal organ for avian is known as thesyrinx. Syrinx is a bony structure that is located at the tracheas’ bottom [64].

The bird forces air in the membranes in the syrinx and sometimes the surroundingair sacthat resonates the sound waves.

It controls the volume and pitch via changing the force applied for exhalation. Birds independently control both the sides of the trachea. Therefore, some bird species can produce two notes at once [64]. The schematic visualization of bird vocal system can be referred at figure1.1. The author of this thesis has usedPraatsoftware to visualize the birdsong waveform files in the figures1.2,1.3. Praat is a software used for audio analysis specifically their density, pitch, formants, annotation and to download stereo or mono sounds [11].

Figure 1.1: Schematic diagram of bird vocal system adopted from [14]

1.3 Research Hypothesis

The author of this thesis addresses the following questions through this work:

1. Is it possible to synthesize the birdsongs similar to the way the human speech is synthesized? Will it generate some meaningful samples?

2. Should a traditional methodology be used or a modern approach of neural waveform modelling or a generative model?

3. Will the listeners be able to discriminate between the different species of this generated bird audio?

4. Do all the models have similar performance for the similar set of bird species?

1.4 Thesis Structure

(10)

Figure 1.2: Waveform and spectrogram of a Great Spotted Woodpecker’s birdsong using Praat [10]

provides overview on the background of neural networks and the neural networks and audio synthesis. The fourth chapter describes the text to speech systems, the fifth presents methods used, the sixth and seventh chapter describe experimental setup and results. The eighth chapter provides discussion on this work and ninth chapter discusses results and also future work in this field.

(11)

Figure 1.3: Waveform and spectrogram of a Common Reed Bunting birdsong using Praat [10]

(12)

2. Traditional Methods for audio generation

Speech generation or synthesis systems are utilized for different purposes such as synthesizers for singing audio [47] or conversion in voice [66]. Speech manipulation, synthesis and analysis is based on the idea of thevocoder. A vocoder synthesizes the human speech and analyzes it. It is named as a vocoder because it includes encoding method i.e. speech analysis and reconstruction of the speech which is also known as the speech synthesis. Vocoders are used to compress the audio, encrypt voices, multiplexing and/or transformation of voice. Vocoder was invented on 1938 by Homer Dudley at Bell Labs for synthesizing the human voice [24]. This work resulted in the development of the channel vocoders used for telecommunications voice codecs. The vocoder has been used extensively for music synthesis [22].

Vocoder consists of the features such as fundamental frequency (F0), spectral envelope and a synthesis algorithm using the speech parameters that were estimated [24]. Although it is observed that the speech synthesized using the conventional vocoder systems is inferior to the speech synthesis using the waveform based systems [9], it is important to go through the traditional methodology of speech generation in order to understand the advantage of using the deep learning techniques for the generation of speech.

2.1 STRAIGHT Vocoder

Usually the speech synthesis performed by traditional vocoders is inferior to the modern waveform systems except STRAIGHT vocoder which is a non-waveform based vocoder.

Excitation of a resonator repeatedly improves the signal to noise ratio for transmition of the resonant information. For example, nanoscale mechanical resonator. But this repetition results in a periodic interference in the time and frequency domains. In order to regain the underlying information in the time and frequency domains following two-step procedure was introduced [44].

1. The first step is to extract power spectra that minimize temporal variation using a complementary set of time windows.

2. The second step is to inverse filtering in a spline space to remove frequency-domain periodicity while preserving the original spectral levels at harmonic frequencies.

STRAIGHT is used for the manipulation of the speech features such as quality of voice, speed, pitch, frequency, timbre and other attributes. This tool is continuously improving to attain the quality close to the original natural speech.

STRAIGHT decomposes speech into two parts namely, source and resonator (filter) information. This decomposition makes it extremely simple to conduct experiments on speech, to perform the initial design objective, and to interpret experimental results [44].

STRAIGHT is basically achannel vocoder. A channel vocoder represents a speech waveform by compressing and encoding the data in such a manner that it retains the intelligibility of the waveform. However, STRAIGHT vocoders’

design objective greatly differs from its predecessors such as pitch and F0based speech representation and restructuring

(13)

in speech synthesis research. High quality synthetic speech by STRAIGHT would provide representation consistent with the perception of sounds [44].

Surface reconstruction:This procedure is a repeated excitation of a resonator to refine the signal to noise ratio (SNR) while transmission of the resonant information. Although some periodic interference in both time and frequency domains is introduced due to this repetition. Due to this interference, it is important to reconstruct the underlying smooth time and frequency surface [44].

Fundamental Frequency Extraction:

Fundamental frequency(F0), is the inverted fundamental period of a periodic signal. It is calculated via mapping a fixed point of frequency to an instantaneous frequency of a short-term fourier transform. The surface reconstruction process described above is heavily dependent onF0. Its also observed that minor errors inF0trajectories affect synthesized speech quality which motivated the development of dedicatedF0extractors [44].

2.2 WORLD Vocoder

WORLD is another traditional vocoder, similar to STRAIGHT. It produces high quality output and provides a way to de- compose the signal for speech into fundamental frequency (F0) spectral envelope. WORLD vocoder synthesizes speech with good quality and quick processing [59]. The basic building blocks of WORLD (DIO, CheapTrick, PLATINUM).

In the following four subsections the author of this thesis provides a brief review of each method. A brief review of WORLD vocoder is in the figure2.1.

2.2.1 DIO: F0 Estimation Algorithm

There are numerous different fundamental frequency (F0) trackers that operate in different signal domains. The two primary domains utilize time-domain characteristics (e.g. autocorrelation) and spectral characteristics (e.g. cepstrum).

WORLD uses DIO: F0estimation algorithm. It is faster than YIN [19] and SWIPE[17] algorithms, with equally good estimated performance. According to [59] the DIO algorithm is fast and reliable.

The DIO algorithm has three main steps [59]:

1. Low-pass filtering with different cutoff frequencies: If the filtered signal has fundamental components only then it forms a sine wave withT i.e. the entire period. Since, target fundamental frequency is unknown, various filters consisting of different cutoff frequencies are used.

2. Fundamental frequency candidates and their reliability: A signal consisting of fundamental component only composes a sine wave. The intervals of the waveform are: positive zero-crossing interval, negative zero-crossing intervals, peak interval and dip interval.

3. Selection: Most dependable candidate is selected.

2.2.2 CheapTrick: Spectral Envelope Estimation Algorithm

The essential parameter for human speech processing is aspectral envelope[59]. Cepstrum predictive coding and linear predictive coding (LPC) are typical algorithms used for modeling the short-term spectral envelope [59]. The authors of the CheapTrick algorithm argue that many algorithms similar to the CheapTrick have been developed but they were not successful in human speech synthesis [59]. The issue here is that the estimation of result is highly dependent on the temporal position of the window [24]. Therefore, it is important to use the time varying component and maintain the accuracy for estimation [24]. To meet such high-quality requirements Legacy-STRAIGHT [45] and TANDEM-

(14)

Figure 2.1: Block diagram of WORLD vocoder inspired from [59]

2.2.3 PLATINUM Algorithm

PLATINUM is an aperiodic parameter extraction algorithm. In this algorithm, a mixed excitation and aperiodicity are used for innate speech synthesizes. Legacy-STRAIGHT [45] and TANDEM-STRAIGHT [46] algorithm utilize aperiodicity speech parameter for the synthesis of periodic and aperiodic signals. WORLD follows the PLATINUM approach.

Legacy-STRAIGHT and TANDEM-STRAIGHT utilize aperiodicity, whereas the WORLD calculates excitation signal directly using the fundamental frequency, spectral envelope and the waveform [59].

2.2.4 Synthesis Algorithm

Legacy-STRAIGHT [43] and TANDEM-STRAIGHT [46] algorithms calculate each of the vocal cord vibration independently using the periodic and aperiodic responses. TANDEM-STRAIGHT directly utilizes periodic responses [46], whereas the Legacy-STRAIGHT algorithm aims at the avoiding buzzy timbre [43] and therefore, manipulates the group delay. WORLD vocoder calculates the vibration of the vocal cord based on the response of extracted signal and its min- imum phase convolution [59]. The computational cost for WORLD vocoder is lower because it has fewer convolutions in comparison to Legacy- STRAIGHT or TANDEM-STRAIGHT [59].

The fundamental frequency is used to determine origin of the vibration of the vocal cord or the temporal positions. In case of the Legacy-STRAIGHT and TANDEM-STRAIGHT, the spectral envelope and apeiodicity are calculated for the determination of the excitation signal [43] [46]. It is calculated using the flattened spectral envelope. Therefore there is difference in the synthesis of these waveforms. Overall, the waveform synthesized by WORLD seems to be closer to the input waveform [59]. The synthesis code used for this vocoder can be referred at appendix list10.

(15)

3. Background on Neural Networks

Aneural networkis a set of algorithms used to analyze a set of data and its relationships using a process that replicates the operations performed by human brain. Therefore, it can be said that neural networks replicate the structure of neurons, and are organic or artificial in nature [2]. In the neural network, a software learns to perform the required tasks by analyzing the given dataset that is known astraining samples. This dataset has certain examples such as some audio files or pictures and a label for each file that defines what it is. For example, a bird species recognition neural network might have an audio file named as xc121.wav along with its corresponding species label. This system would find similarities in the audio with the same labels and dissimilarities in the audios with different labels and learn these patterns. These patterns are known asfeatures. For example, an object recognition system that works on identifying pictures would have thousands of images with labels, for example, images of schools, houses, traffic signals, cars, trucks, parks, and so on. This system would learn to correlate the visual patterns of the images with their particular labels [73].

A neural net, with a particular set of algorithm, comprises of potentially millions of nodes that are densely interconnected, and simple processing. Many neural networks are arranged inlayersof nodes in afeed-forwardfashion. This means, the data moves in one direction. An individual node is possibly connected to various other nodes in the layer underneath it ( where it receives data), and various nodes in the layer above it (where it sends data). Every node has a number assigned that is known asweight. When the network is active, the data received by node is completely different, which means; a different number over each of its connections multiplied by its related weight. Next, the overall sum of resulting products is done to receive a single number. If this number is less than the threshold value, no data is passed to the next layer.

Whereas if this number is greater than the threshold value, the nodefires, i.e. it sends the sum of the weighted inputs

— along all its outgoing connections. While training the neural network, continual adjustment of the objective function weights and thresholds are done, until the training data which has the same labels continuously yield similar outputs.

Difference between shallow and deep learning is further explained with the help of table3.1and is illustrated with the help of figure3.1.

Table 3.1: Difference between Deep Learning and Shallow Learning [3]

Deep Learning Shallow Learning

Definition transformation and extraction of feature and form a relation between stimuli and neural response

use neurons to transmit data and output values through connections Feature extraction perform it within the network do not perform it within the net-

work Architecture CNNs, Unsupervised pre-trained,

Recurrent and Recursive NNs

Feed-forward, symmetrically connected NNs

(16)

Figure 3.1: Block diagram of (a) shallow learning (b) deep learning inspired from [3] . In shallow learning the feature extraction learning variables are not dependent on data whereas in deep learning the values of the feature extraction learning variables might change with the change in data.

3.1 Neural Networks and bird audio synthesis

Synthesized version of birdsongs using neural networks is refined as compared to the vocoder described in previous chapters. It helps to generate birdsongs with varied amplitudes since birdsongs vary wildly and sonically across species [15]. This birdsong problem can be divided into two main research areas: first being information retrieval, which aims to design models capable to recognize the semantics present in birdsong waveform signals [7]; and algorithmic composition, where the goal is to generate new birdsongs computationally [55].

3.2 Feature Extraction

For a successful network it is important that meaningful input features are generated because the vocoders used in the experimental part are conditioned on external acoustic features. To ensure meaningful input features in bird audio generation the author of this thesis first discriminates between singing/calling (signal) and noise or silence (noise) within the first audio file. Usually the noises observed in birdsong dataset are that of flowing water, some group of birds chirping, animal noises, humans talking or traffic noises. This procedure is known asbird activity detection[67] and is illustrated in the figure3.2.

A threshold is manually set by observing the audio and energy plot. For example threshold for the example below is160 dB. All the sample that are equal to or higher than160Hz are classified as bird call, and all the samples less than160dB are noise. This is illutrated in figure3.2. Then, thespectrogramis computed for both bird call and noise audio to use it as the training features. A spectrogram is a visual representation of the variation in signal energy across time and frequency.

The author of this thesis has divided the audio waveform into equally sized chunks i.e. the frames. Then computed the fast fourier transform(FFT) across the frequency axis over these frames. Using such an auditory frequency scale has the emphasis on the details in lower frequencies, while de-emphasizing high frequency details, which are dominated by fricatives and other noise bursts and generally do not need to be modeled with high fidelity [79]. Each of this chunk is used to discriminate the bird call spectrogram as a unique sample for training/testing in the neural network. This scale is illustrated in the figure3.3and the feture generation procedure is summarised in the figure3.4.

(17)

0:00 0:15 0:30 0:45 1:00 1:15 1:30 1:45 2:00 2:15 seconds

0 512 1024 2048 4096

Hz

Original spec rogram

+20 dB +40 dB +60 dB +80 dB

0 2000 4000 6000 8000 10000 12000

−200

−100 0 100

200 Energy plo

0 100 200 300 400 500

160 170 180 190 200 210

Energy plo of bird ac ive frames

0 1 2 3 4 5 6

seconds 0

512 1024 2048 4096

Hz

Birdcall spec rogram

+20 dB +40 dB +60 dB +80 dB

Figure 3.2: Representation of bird activity detection plot. Here, a frame level bird call detection is performed over the energy plotto identify "bird active" region in the given audio file.

(18)

0 50 100 150 200 250 DFT bin

0.0 0.2 0.4 0.6 0.8 1.0

filter gain

Figure 3.3: An example of a mel-frequency filterbank. Mel scale relates the tone, pitch and frequency as it is perceived by a human ear to the actual frequency. It helps to incorporate the features closer to the way a human ear would perceive it. This graph is generated using librosa library [53]

Figure 3.4: Feature generation

(19)

3.3 Generative Adversarial Networks

The generative adversarial networks(GANs) use two neural networks that compete with one another to generate new data [32]. In this thesis, a GAN named parallel waveGAN is used to generate birdsongs (see subsection6.3.3) which is a type of GAN, and is used to generate birdsongs. Thegenerative networkproduces the samples and thediscriminative network evaluates the samples. The generative network learns the features to map the samples from a latent space, while the discriminative network learns the methodology to distinguishes amongst the samples produced using the generator network and the original samples. The purpose of training the generative network is to increase the error rate of the discriminative network. The generative network prompts or generates candidates, while on the other hand, the discriminative network gauges or evaluates them. Specifically, the generative network’s training and development objective and policy is to increase the error rate of the discriminative/distinctive network. Training includes presenting samples right from the training dataset, until it achieves sufficient accuracy. The training by generator is based on whether it succeeds in fooling the discriminator or not. Specifically, the generator is seeded with randomized input that is sampled from a predefined latent space. Afterwards, candidates that are synthesized by the generator are evaluated and gauged by the discriminator as well. Independent backpropagation procedures and processes are performed on all the networks, to make sure, the generator produces better samples after examination, while the discriminator becomes more skilled at flagging synthetic samples. GANs often suffer from so-calledmode collapsewhere they fail to generalize properly, missing entire modes from the input data. For example, a GAN trained on the MNIST dataset of each digit. MNIST dataset (Modified National Institute of Standards and Technology database) is a large database containing many samples of handwritten digits [20].

GAN might omit a subset of the digits from its output. Some researchers feel the root problem is a weak discriminative network that is not able to identify the pattern of omission, while others feel that the reason is a bad choice of objective function. Some solutions have been suggested [32].

3.4 Autoencoders

Autoencodersuse a network that encodes the input and extracts latent representation. Then this encoded information is passed through a decoding network and recover the original data. In this thesis, wavenet autoencoder i.e. a variant of autoencoder is used to generate birdsongs (more information on it can be found in the subsesction5.2). Ideally, this latent representation of original data preserves the salient features learnt by the network. Ideally, the learnt feature representation providesdisentanglementof variations of interest from noise. These qualities of network make it desirable and are typically obtained through a judicious application of regularization techniques and constraints or bottlenecks. The representation learned by an autoencoder is thus subject to two competing forces. On one hand, the network should provide the decoder with information required for perfect reconstruction and should capture within the latent the maximum input data characteristics. These constraints force a number of information to be discarded thus, preventing the latent representation from being trivial to invert, for instance by exactly passing through the input [51] [30].

(20)

4. Text-to-Speech systems

Text-to-Speech Systems (TTS)as the name suggests, takes words as input on a computer or other digital device and converts it into a speech waveform represented in figure4.1. It is used as an assistive technology to read digital text aloud.

Some of the use cases of TTS include air/travel information systems, news reading, story telling and desktop assistant.

Figure 4.1: Flowchart for Text to Speech system

4.1 Traditional TTS

There are two main types of TTS systems, namely,parametric TTSandconcatenative TTS. In choosing between alterna- tive TTS techniques, thequalityof the the generated audio is a key consideration [78] [37]. Quality is determined based on the naturalness of audio, clarity, and audibility. Other important characteristics include emotions, pronunciation, di- alect, timing structure, and sentence formation. Besides quality, other important considerations includeintelligibilityand comprehensibilityof the generated speech.

Traditional text-to-speech systems’ architecture is described in the figure4.2. These TTS systems are primarily used for speech synthesis. However, nowadays, they are also used for instrumental music synthesis, singing synthesis [29]. In this thesis, the author addresses another task, synthesis of bird audio, as will be described in subsequent chapters.

(21)

Figure 4.2: Architecture of traditional Text to Speech system

Concatenative TTS: The concatenative text-to-speech system is dependent on the quality of audio, i.e., it needs high- quality recordings. These recordings are combined to create speech. Even though the audio is clear and comprehensible, it is not necessarily natural because it is impossible to convey each word in different emotions like stress, sadness, happiness, excitement, and boredom. These recorded or restored audio are segmented and labeled on phone level, then phrases, and finally sentences. Therefore, the system demands extensive databases and time seeking execution, making it less robust. Concatenative TTS is also known asunit selection algorithm. A unit can be a phone or a set of phones. If a set of phones correspond to a word then unit is a word. Architecture of concatenative text to speech system is demonstrated in the figure4.3.

(22)

1. Intelligible audio with high-quality [85]

2. Helps to preserve original audio i.e. voice of a popular actor, singer, extinct animal or bird [16] [77] [57]

Disadvantages of such systems are:

1. Audio generation is a time-consuming task due to large database. [8]

2. The resulting audio may lack the smoothness in comparison to the original one. It may also lack emotions, naturalness.[27]

For example, Acapela Group has a singing voice synthesis system that records singers’ voices to preserve the heritage [21]. There is also a Vocaloid system [48], i.e., virtual singers in personal computers. Using this software, any singer can be chosen to use their voice to perform any chosen song just like concatenative systems.

Parametric TTS: This method is a statistical approach where one defines a model with parameters, approximates values via training the models, and generates audio according to one’s requirements. This helps us to overcome the issues faced by Concatenative TTS. Here, audio can be generated by using parameters such as fundamental frequency and magnitude spectrum.

A Parametric TTS system has two stages:

1. The first step is to extractlinguistic featuressuch as phonemes and duration.

2. The second step is to extractvocoder featuresthat represent the corresponding speech signal. Feature representations such as cepstrum, spectrogram are used for audio analysis. Here, cepstral analysis is performed to decouple voice source and vocal tract filter components. Therefore, these arehand-craftedfeatures.

Advantages of Parametric TTS are:

1. Improvement in the naturalness of audio 2. Flexibility: easy to change the emotions

3. Lower development cost: lighter databases and less complex codes Disadvantages of Parametric TTS are:

1. Lower audio quality if the training audio is noisy, constantly buzzing sound in the audio generated.

2. Generated audio sometimes due to muffled training audio sounds robotic

Using the above mentioned hand-crafted features and linguistic features, and learn these features using a specific vocoder model, to generate waveform.

Speech synthesis systems trained parametrically are feasible and robust provided that parameters can be approximated.

A model can be trained to generate different types of audio according to the requirement. To train a parametric system it requires good amount of data and complexity of code is less in comparison to concatenative TTS. In this problem statement, requirement is to generate bird audio for different species.Theoretically, parametric systems are perfect for the requirement, but practically various issues make it difficult to generate natural and intelligible bird audio. This involves problems like noisy audio, buzzing sound, traffic noise, or multiple speakers speaking at a time.

In simple words, researchers are coding the features at each and every stage of the modelling pipeline, in the hope to

(23)

Figure 4.4: Flowchart of a deep learning TTS model

4.2 TTS without T

TTS-without-T task trains neural network without any text or phonetic labels on audio illustrated in the figure4.5.

Examples of a such approach is the so-calledABCD-Variational Autoencoder[60]. It uses the so-calledDirichlet-based clusteringmethodology. The encoder identifies the most optimal frames statistically using non-parametric learning. The ABCD-VAE represents linguistic information based on the so-calledphonetic posteriorgrams[38], but missed information based on mapping between source and target speech parameters [75]. Similar issues were faced in previous years Zero Resource Speech Challenge [26] [86]; mostly in discrete representations: particular representations having low bi- trate and higher ABX error rate in comparison to the baseline. ABX test is a listening based test. In this test, each listener is provided with triplets of audio files A, B and X. A and B correspond to natural audio of two different species, and X represents either a natural or a resynthesized sample of one of the species i.e. X belongs to similar species as either A or B. This indicates a general issue in unsupervised learning of audio representations [60].

Learning relevant unsupervised features from data is an essential topic of research in machine learning, specially its usage inAutomatic Speech Recognition(ASR). Vast difference in the performance is observed between systems in languages that are rich in resources and languages that are low in resources due to the dependency on the ASR. Thus, the need of unsupervised machine learning approaches in the audio generation to reduce the gap. One of the approach used for this purpose isacoustic unit discovery(AUD) [68] where the focus is on the identification of a set of phones or a set of units, from the given unlabeled speech from a particular language. Most AUD methods use the latent space features units deduced from the data. Then using the Bayesian approach the model uses generative process like hidden markov models HMM orGaussian mixture models GMM with a Dirichlet process prior to both the number of units and the

(24)

Figure 4.5: Model Architecture overview for TTS without T. Here each of these three components is trained independently.

(25)

unsupervised term detection system [18].

Nowadays,Bayesian neural networksare also popular in this area. They utilize the strength of neural networks along with theself-regularizingeffects of Bayesian models in a structured manner. Alternatives of the Variational AutoEncoder (VAE) [68] such as thevector quantized VAEand HMM-VAE have also been considered. Since the2019ZeroSpeech challenge [25], a shift towards the synthesis-based evaluation scheme is observed. In this methodology, audio waveform synthesis is performed using the learned units.The quality of these synthesized waveforms’ is used to measurecharacter error rate(CER) evaluated by humans. It is used as a metric to determine the AUD quality. The input to train a synthesis systems is the set of waveforms and their respective labels. The accuracy of synthesis system is directly proportional to the accuracy of the training labels [93].

Recent research is being conducted on neural networks andintermediate discretization [13] [52] which means they analyze VQ neural networks for AUD. For ZeroSpeech 2020, two such models were proposed. One such model is a vector-quantized variational autoencoder (VQ-VAE) [63]. This model maps speech to a discrete latent space and then reconstructs its native waveform. According to the authors, opting for a light recurrent neural network instead of wavenet as the decoder results in a robust model that is fast and trains on a single GPU. The second model is based on VQ-wav2vec along with thevector quantization and contrastive predictive coding(VQ-CPC) [5]. The authors have used a contrastive loss to train the model to discriminate between future acoustic units using a set of negative examples. An additional comparison based on across-speaker and within-speaker sampling is done using the negative examples and concluded that the important feature for speaker invariance is within-speaker sample. On conducting ABX tests on English and Indonesian data, the author noticed that the intermediate discretization models achieve best results in comparison to all the submissions in ZeroSpeech 2019 and 2020 challenges. Both the models are competitive with VQ-CPC and achieve the best naturalness and speaker-similarity scores on the English dataset [63].

(26)

5. Methods selected for experiments

The author of this thesis has prime interest to compare traditional and neural vocoding techniques in the context of bird vocalizations. Therefore, the focus on four widely adopted vocoders namely, WORLD vocoder described in section2.2, wavenet vocoder, wavenet autoencoder and parallel wavegan described below. These methods were selected because they are the most recent methodologies with the best results and are accessible in terms of code, background literature.

A brief summary on these four vocoders can be found in the table5.1.

5.1 Tacotron2: Wavenet vocoder

Tacotron is a generative text-to-speech model that is trained on text and audio pairs. It synthesizes audio directly from a given text. It generates speech at frame level, which makes it way faster than sample level autoregressive methods.

It is comprised of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel- frequency spectrograms [79]. It has a feature prediction network that maps character embeddings to respective melscale spectrograms, that is known as a recurrent sequence to sequence feature prediction network. These melscale spectrograms are input to a modified WaveNet model which acts as a vocoder that synthesizes time domain waves from the spectrograms. The decoder uses a hyperbolic tangent content-based attention decoder. The waveforms are then generated using the so-calledGriffin-Lim algorithm[33]. This is illustrated in the figure5.1.

5.2 Wavenet Autoencoder

The author here has used a wavenet autoencoder model based on [28]. This model consists of an autoregressive decoder based on the temporal codes learned from the raw audio waveform. Additionally, the attention component that supports the input and output samples timing modification. Attention is trained in an unsupervised way, by teaching the neural network to recover the original timing from an artificially modified one. It has a robust TTS pipeline that can be trained without any transcript. Using birds dataset [83], the model learns a manifold of embeddings that allows for morphing between bird audio, meaningfully interpolates to create new types of realistic sounds. [28]. This is illustrated in the figure5.2.

5.3 Parallel wavegan

Parallel wavegan used here is a distillation-free network that is fast and has a light and compact waveform generation method. This method uses a generative adversarial network inspired from [92]. It is trained over jointly optimized multi resolution spectrogram and adversarial loss functions. This captures the waveforms’ time and frequency distribution effectively. Since this method does not require density distillation used in the conventional teacher-student framework [71] [72], this model can be trained effortlessly. The model generates high quality realistic audio. The simple training

(27)

Figure 5.1: Tacotron2: Wavenet Vocoder adopted from [79]

Figure 5.2: Wavenet Autoencoder adopted from [28]

(28)

Figure 5.3: Parallel wavegan adopted from [92]

Table 5.1: Summary of vocoders

WORLD Wavenet Vocoder Wavenet Autoen-

coder

Parallel wavegan Definition vocoder based syn-

thesis system.

encoding of frag- ments of audios from birds used to train the network and produce sounds.

unsupervised training of network using bird waveforms.

Decoder to generate bird audio.

distillation-free, fast method using a GAN

Pros high quality audio generate language for birds

good quality audio and generate audio for specific species and individuals

good quality audio and generate audio for specific species Cons trains only on single

audio; recreates the same training audio

need lot of data to train; quality of generated audio is noisy

generated audio is low in pitch

generated audio is low in pitch.

(29)

6. Experimental setup

The scope of this Master’s thesis is to use existing generative models that would learn the properties of bird audio and generate it in a smooth way with realistic sounding intonations. This implementation for bird audio generation is similar to generative models of languages or images. The experiment is implemented using three models namely, WaveNet Vocoder, WaveNet Autoencoder, Parallel wavegan which are chosen because of their performance on audio generation for human speech, singing data and music.

Finally, the author observed that these models also picked up on other characteristics along with bird audio apart from the bird song itself. For instance, it also mimicked the acoustics and recording quality and the background noise such as noise of the river flowing, wind, and traffic. Therefore, to avoid these issues the author has chosen a clean and high-quality dataset.

6.1 Dataset

xccoverbldataset [83] is used here for the experiments. It is a file-labelled dataset where each file consists of bird song for a single species only. This dataset is a subset of the available recordings from the large Xeno Canto Archive (http://www.xeno-canto.org/), a website for sharing recordings of sounds of wild birds from all across the world. The xccoverbl dataset [83] is collected for the common UK bird species.

The original authors of xccoverbl dataset [83] retrieved three different recordings for each species from Xeno Canto collection. Xeno Canto websites give a quality rating to all the recordings ranging from grade ’A’ to ’E’, where ’A’ is the highest quality and ’E’ is the lowest quality. These ratings are self-reported i.e. given by the uploaders and also based on the listeners feedback. The search query made by the authors of the dataset also requested high-quality recordings (quality label ‘A’), and preference was given to bird songs instead of the calls, wherever possible.

This dataset has widely varying characteristics, for example in the typical duration of the sound files, the recording location, and the number of classes to distinguish (Table6.1). Therefore, it has strong overlap in the species list [83].

Note all the files in this dataset were stored in FLAC format. FLAC is uncompressed format while for the processing purpose it is converted into wav format using sox software [6].

Table 6.1: Dataset Description

Dataset Location Files Total Duration Mean Duration (with BAD) Classes Labelling

xccoverbl UK/Europe 264 4.9h 67s 88 Single-label

(30)

Table 6.2: Species level training and testing details. For all the 88 species listed in the table, 3 audio files (each 30 seconds before bird activity detection or BAD) are used for training while 2 other files (each 20 seconds before BAD) are reserved for testing purposes. The species selected for subjective experiments are highlighted. Mean duration is 10 seconds for one audio file. BAD stands for bird activity duration.

S.No Species S.No Species

1 Common Redpoll 45 Marsh Warbler

2 Sedge Warbler 46 Eurasian Reed Warbler

3 Long-tailed Tit 47 Eurasian Skylark

4 Meadow Pipit 48 Tree Pipit

5 Common Swift 49 Canada Goose

6 Dunlin 50 European Nightjar

7 European Goldfinch 51 Eurasian Treecreeper

8 European Greenfinch 52 Black-headed Gull

9 Western Jackdaw 53 Rock Dove

10 Stock Dove 54 Common Wood Pigeon

11 Northern Raven 55 Carrion Crow

12 Rook 56 Common Cuckoo

13 Eurasian Blue Tit 57 Common House Martin

14 Great Spotted Woodpecker 58 Black Woodpecker

15 Corn Bunting 59 Yellowhammer

16 Common Reed Bunting 60 European Robin

17 Common Chaffinch 61 Eurasian Coot

18 Common Snipe 62 Common Moorhen

19 Eurasian Jay 63 Red-throated Loon

20 Eurasian Oystercatcher 64 Barn Swallow

21 Eurasian Wryneck 65 Willow Ptarmigan

22 European Herring Gull 66 Common Linnet

23 River Warbler 67 Red Crossbill

24 Common Nightingale 68 European Bee-eater

25 African Pied Wagtail 69 Western Yellow Wagtail

26 Spotted Flycatcher 70 Great Tit

27 House Sparrow 71 Eurasian Tree Sparrow

28 Grey Partridge 72 Coal Tit

29 European Honey Buzzard 73 Common Pheasant

30 Common Redstart 74 Common Chiffchaff

31 Wood Warbler 75 Willow Warbler

32 Eurasian Magpie 76 European Green Woodpecker

33 European Golden Plover 77 Grey Plover

34 Willow Tit 78 Marsh Tit

35 Dunnock 79 Eurasian Bullfinch

36 Goldcrest 80 Eurasian Nuthatch

37 Eurasian Collared Dove 81 European Turtle Dove

38 Tawny Owl 82 Common Starling

39 Eurasian Blackcap 83 Garden Warbler

40 Common Whitethroat 84 Lesser Whitethroat

41 Wood Sandpiper 85 Common Redshank

42 Eurasian Wren 86 Redwing

43 Common Blackbird 87 Song Thrush

44 Northern Lapwing 88 Eurasian Golden Oriole

(31)

6.2 Data Representation Technique

xccoverbl dataset is file-level labelled with the bird species. This enables model training (and audio generation) leveraging from the species labels. For this purpose, theWaveNet encodersare used. In particular, WaveNet is used as an unsupervised representation learning approach. The aim is to learn a waveform representation and to capture high-level semantic information, reminiscent of phone classes in human speech. Since the bird species for each audio recording are already known but the bird language is not there, the author relies entirely on unsupervised learning using WaveNet autoencoders to learn bird language elements.

6.3 Models

The author has proposed to generate bird audio using three types ofgenerative modelsnamely, Tacotron 2: WaveNet Vocoder [79], WaveNet Autoencoder [28], and Parallel wavegan [92]. The link to the codes adopted for these existing models can be found in the table6.8. A brief description of each of the three approaches are as follows.

Tacotron 2: WaveNet Vocoderis an unifiedlocation-sensitive attentionbased neural approach in which Tacotron helps to combine text and mel spectrogram, followed by a WaveNet Vocoder which is trained to generate bird audio. Location-sensitive attention considers input and its position in the sequence. This helps the network to not repeat or skip the syllables.

Wavenet Autoencodersperform unsupervised learning of meaningful latent representations of bird audio. The goal of these networks is to learn a representation in an unsupervised manner that is able to capture high level semantic content from the signal, for example phoneme identities.In the case of speech, for instance, this may refer to different phonemic units. The network is trained, ideally, to remain invariant to confounding low-level details in the signal such as the underlying pitch contour or additive background noise.

According to the authors [92], some of the advantages ofparallel waveganinclude distillation-free, fast, and small-footprint waveform generation. Distillation-freemeans that it does not require the student network to learn and replicate the probability for its samples based on the distribution learned by the teacher network. Here, a non-autoregressive WaveNet is trained on multi-resolution spectrogram and adversarial loss function by jointly optimizing them. This helps the parallel wavegan network to capture time-frequency distribution effectively and generate realistic waveforms.

6.3.1 Tacotron2: WaveNet Vocoder

This training process is inspired from Tacotron2 model for speech synthesis from text [79]. For now, the author has randomly assigned the text for bird audio using the human speech text from LJ Speech data [39]. In the LJ Speech dataset authors have divided data into 50classes (50 speakers) and in total there are278audio files, and each audio is named as class number followed by file number. For example, a file with the name of "LJ01-0010" means that this audio belongs to class01i.e. the words spoken by the first speaker and is10th audio recording done by the first speaker. Each audio is assigned a sentence in English. Similarly, the author has named each file with the class number followed by a file number in the bird audio dataset used by us. Here, each audio is assigned a sentence in English so that there is an understandable human language for birds. There are88classes and264audios in total.

Firstly training is done for the feature prediction network using maximum-likelihood method with batch size of32. Here, training is performed on single GPU using the Adam optimizer [49] with hyperparameter settings as presented in table6.3.

Table 6.3: Hyperparameters for feature prediction network of Wavenet Vocoder

Hyperparameter Value

Batch size 32

Adam optimizer β₁= 0.9,β₂= 0.99

Learning rate 10⁻³

L2 regularizations 10⁻⁶

The output of this feature prediction network is frame-level ground truth-aligned predictions. Then, wavenet is trained on teacher- forcing mode over these frame level predictions. To ensure that each predicted frame aligns with the target waveform samples, each

(32)

Table 6.4: Hyperparameters for training the Wavenet Vocoder

Batch size 64

Adam optimizer β1= 0.9,β2= 0.99,= 10⁻⁸

Learning rate 10⁻⁴

6.3.2 WaveNet Autoencoder

The author of this thesis has adopted the wavent autoencoder model [28] that learnstemporal encodingsof the audio data to perform neural synthesis. Thus, this approach removes the necessity of conditioning with external features. The temporal encoder has same dilation block as wavenet but its convolution isnon-causali.e. it considers the entire recording for a given input chunk. This model consists of thirty computational layers, then an average pooling layer in the last to create a temporal embedding of16dimensions for every512samples. The vanilla WaveNet decoder with30layers where each layer is of1x1convolution along with a bias is used to upsample the embedding to its original time resolution with hyperparameter settings as presented in table6.6. This model is trained synchronously for100k iterations with a batch size of32and hyperparameter settings as presented in table6.5. The code training and synthesis code used for the wavenet autoencoder can be referred at appendix10.

Table 6.5: Hyperparameters for training the Wavenet Autoencoder

Batch size 32

Avg pooling layer Temporal embedding16dimension

samples per layer 512

iterations 100K

computational layers 30

Table 6.6: Hyperparameters for Vanilla Wavenet decoder Hyperparameter Value

layers 30

convolution 1x1

bias True

6.3.3 Parallel wavegan

The parallel wavegan is inspired from the existing parallel wavegan model for multi-resolution spectrogram [92]. This model consists of30convolutionally dilated layers with exponentially increasing3dilation cycles with64residual and skip channels and filter size of3. Discriminator consists of10non-causal dilated 1-D convolutions with leaky ReLU activation (α= 0.2). Linearly increasing dilations in the range of one to eight along with the stride of1were applied for the 1D convolutions except for first and last layer.64 residual and skip channels with the filter size of3is used. Weight normalization is applied to all the convolutional layers for both the generator and the discriminator [76].

Models were trained for100K steps using RAdam optimizer [54] with hyperparameter settings as presented in table6.7. The discriminator was fixed for first50K steps after which both the models were trained jointly. The minibatch size was set to eight along with the24K time samples audioclip i.e. length of each audio clip was1.0second. The initial learning rate for generator is0.0001and for

(33)

Table 6.7: Hyperparameters for training the parallel wavegan

convolutionally dilated layers 3 residual and skip channels 64

filter size 3

non-causal dilated 1-D convolutionss 10 leaky ReLU activation α= 0.2 residual and skip channels 64

iterations 100K

RAdam optimizer(first50K steps) −1e⁻⁶

computational layers 30

minibatch size 8

timesamples 24K

Table 6.8: Models and their codes

Model Code

WORLD https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder WaveNet Autoencoder https://github.com/magenta/magenta/tree/master/magenta/models/nsynth

Parallel wavegan https://github.com/kan-bayashi/ParallelWaveGAN Wavenet Vocoder https://github.com/r9y9/wavenetvocoder

6.4 Evaluation Metrics

In this work the outcome from different vocoders is evaluated both in objective and subjective ways. The objective evaluation analyses the results based on the quantifiable results whereas the subjective evaluation analyses the results based on the feedback from the listeners. The detailed analysis of both is in the following subsections.

6.4.1 Objective Evaluation

Root mean square error(RMSE) [4] is an objective measure used in synthetic audio evaluation. It measures the distance of the mel cepstra of two recordings. The lower the RMS between natural and resynthesized audio, the higher the quality [36].

For objective evaluation of the synthesized audio the mel cepstral distortion (MCD) error is calculated using theequation(1). Here, mcrepresents the mel cepstral of the original audio andmcsynthrepresents the mel cepstral of resynthesized audio.

s1 T

X

t

2||mc(t, i)−mcsynth(t, i)||², (6.1)

wheremcrepresents the mel cepstral of the original audio andmcsynthrepresents the mel cepstral of synthesized audio andT is the number of frames andtis the timestep slice for one audio file.

It is used in assessing the quality of parametric speech synthesis systems, including statistical parametric speech synthesis systems, the idea being that the smaller the MCD between synthesized and natural mel cepstral sequences, the closer the synthetic speech is to reproducing natural speech. It is by no means a perfect metric for assessing the quality of synthetic speech, but is often a useful indicator in conjunction with other metrics [36].

6.4.2 Subjective Evaluation

(34)

In practice, the author has gathered the responses to both simultaneously. An ABX test is a testing method to perform comparison between given choices using a sesnsory stimuli. In this test, the participant is presented with two known samples A and B and an unknown sample X which is from the same class as A or B. The participant has to identify the class of X as either A or B.

In ABX test, each listener is provided with triplets (trials) of audio files. A and B correspond to natural audio of two different species, and X represents either a natural or a resynthesized sample of one one of the species. The listeners are asked to choose whether X resembles more A or B. The author prepared a total of 10 trials (ABX triplets) per each of the three vocoders; along with natural audio this implies a total of 40 listening trials per subject. This was deemed as a suitable compromise between gathering enough results but avoiding listening fatigue. Each audio is of 10 seconds and there are 40 trials in total i.e. 400 seconds and users might listen to a trial again in case of confusion and additionally there are some questions regarding the audio device used to attempt the test, listeners background and if any difficulties were encountered so, around 10 to 12 minutes listener would have to spend in this test. If the number of audios are increased then duration of this test would increase and the participant would not be able to concentrate for that much time and it would impact the accuracy of the results. The A and B samples are selected from the training set used to train the neural vocoders, and X is always selected from adifferentaudio file corresponding to either A or B; this way, the subject cannot do trivial

‘text-dependent’ comparison of original-vs-resynthesized file, but has to pay attention to the general properties of the two species. The results are summarized for the ABX test as percent-correct identification rate broken down according to vocoder (Natural,WORLD, WaveNet autoencoder,Parallel wavegan).

To have some degree of confidence number of trials must be increased. If the audio samples X, A and B are quite similar then the listener might choose A or B randomly. In such a case, probability for listener to choose "X=A" or "X=B" is0.5. Now in this situation even if the listener answers correctly, it does not prove anything. To achieve the degree of confidence, each listener is provided with multiple trials in a random order to determine that correct answers are statistically significant i.e. have95%confidence interval [61].

The quality evaluation is based on 5-point mean opinion (MOS) score rating. In practice, the subjects do both the ABX and the quality rating at the same time: they are asked to rate the X sample in the scale[1. . .5], where1means that X does not resemble bird sounds at all while5means X resembles perfectly bird sound. In preparing the listening test, the order of the trials, as well as the order of A and B samples, is randomized, with different order for each subject. The duration of each of the 40 samples was fixed to be the same i.e. of10seconds. The audio files are additionally normalized in their energy using ffmpeg tool [74]. FFmpeg stands for Fast Forward MPEG (Motion Picture Experts Group). It is a command line tool used to convert audio and video in the desired format, join two audio or video files or extract a specific component from the given audio or video file.

The author recruited a total of17subjects. In practice, the samples were presented through a PHP-based web-forms. Each subject was free to listen to the samples as many times as needed, at their own pace. As the author collected a number of potentially identifying metadata (e.g. information of hearing losses and languages spoken), the data protection officer of the University of Eastern Finland was consulted on the best practices. All the subjects took part voluntarily and were informed about the aims of the study following standard consent form templates. No compensations were provided. Please refer to the screenshots of the listening test form to know more about the test in the appendix figures10.1,10.2.

(35)

7. Experimental Results

The results for the evaluation of each model over bird audio data are discussed in an objective and subjective manner in the following subsections.

7.1 Objective Evaluation Results

The RMSE evaluations of the three vocoders in table7.1indicate the lowest and highest values for WaveNet autoencoder and parallel wavegan, respectively, with the WORLD vocoder between the two. These results were obtained over the same10audio files per method as presented to the listeners in the next subsection (underlined bird species in the dataset table6.2). The standard error of mean is calculated as :

1.96∗√ σ²/√

n (7.1)

Here, n is the number of test files its value is 10.σ²is the variance its calculated as:

σ²= Pn

i=1(xi−µ)² n

Table 7.1: Average root mean square error (RMSE) along with 95% confidence range from standard error of mean (SEM).

Model RMSE

WORLD 0.6879±0.41

WaveNet Autoencoder 3.2047±1.20 Parallel wavegan 1.82647±0.67

7.2 Subjective Evaluation Results

Here the models and their respective mean identification accuracy is demonstrated in the table7.2which means how accurately the listener was able to identify the class for generated sample ’X’. In table7.3mean opinion score for each model is calculated, it demonstrates the quality of bird audio generated by each model which means the listener has to score for each ’X’ i.e. it may be generated or natural on the basis of how much it resembles a bird song. The listener had to score it on a scale of1to5. 1being the lowest score i.e. does not resemble a bird song at all and5being the highest score i.e. highly resembles a bird song. The author had an initial pool of17subjects. But two subjects were excluded because they had only2and3correct responses in the ABX test on the 10natural samples. The results summarized below correspond to the responses of the remaining15subjects, all who obtained at least 5/10correct on the natural samples.

7.2.1 Subjective results: species discrimination (ABX)

(36)

Table 7.2: Models and mean listener identification accuracy Model mean accuracy (%)

WORLD 71.17

WaveNet Autoencoder 71.76

Parallel wavegan 70.58

Natural 73.53

Table 7.3: Models and mean opinion score along with 95% confidence range from standard error of mean (SEM).

Model mean opinion score

WORLD 3.81±0.43

WaveNet Autoencoder 3.13±0.65 Parallel wavegan 2.65±0.57

Natural 3.96±0.48

• natural bird samples are classified more accurately than the resynthesized samples;

• subjectively no significant differences are observed between the vocoders.

The first result suggests that naive listeners are able to identify X more accurately than by guessing, in average terms. The statistically significant gap observed in mean identification accuracy between natural and re-synthesized samples, however, suggests that vocoding suppresses some species-specific relevant for listeners. The difference in the objective results based on the RMSE score indicates that more training data and/or fine tune the parameters of models to obtain more "bird-like" generated patterns is required. Finally, in the listening test even the natural samples are not classified perfectly. This might be because the author did not include familiarization (training) phase for the listeners, and because (most of) the subjects in this experiment are non-experts.

Table 7.4: Listener-specific accuracies

Subject Acc. % (natural) Acc. % ( all vocoders)

id1 90.00 76.67

id2 100.00 83.33

id3 100.00 93.33

id4 90.00 60.00

id5 80.00 73.33

id6 60.00 50.00

id7 50.00 50.00

id8 60.00 70.00

id9 90.00 80.00

id10 60.00 40.00

id11 90.00 90.00

id12 90.00 90.00

id13 70.00 76.67

id14 80.00 83.33

id15 90.00 86.67

Table7.4breaks down the accuracy per listener. Even though the overall results for the vocoders is noticed to be a bit similar, accuracy

Bird song synthesis using neural vocoders

University of Eastern Finland School of Computing