Generating speech from model trained to classify speech utterances

(1)

Generating speech from model trained to classify speech utterances

Bilal Soomro

Master’s thesis

School of Computing Computer Science

11th June 2019

(2)

(3)

university of eastern finland

, Faculty of Science and Forestry, Joensuu School of Computing

Computer Science

Soomro, Bilal: Generating speech from model trained to classify speech utterances Master’s thesis, 56 p.

Supervisors: Ville Hautamaki and Anssi Kanervisto 11th June 2019

Abstract: Deep neural networks (DNN) are able to successfully process and classify speech utterances. However, understanding the reason behind a correct or incorrect classification by DNN is difficult. It is unclear on what features they pick up when learning. One such debugging method used with image classification DNNs isactivation maximization, which generates example-images that are classified as one of the classes.

In this thesis, we evaluate applicability of this method to speech utterance classifiers as the means to understanding what DNN “listens to”. We trained a classifier using the speech commands corpus and then use activation maximization to sample from the trained model. Then we synthesize audio from features using WaveNet vocoder for subjective analysis. We measure the quality of generated samples by objective and crowd-sourced human evaluations. Results show that when combined with the prior of natural speech, activation maximization can be used to generate examples of different classes. Based on these results, activation maximization can be used to start opening up the DNN black-box in speech tasks.

Keywords: deep neural networks, speech recognition, convolutional neural network, supervised learning, autoencoders

ACM CSS (2012)

•Computing methodologies →Speech recognition; Neural networks;Artificial in- telligence;Supervised learning by classification;

(4)

(5)

Acknowledgments

I wish to acknowledge my supervisor Ville Hautamäki for fostering my interest in speech recognition and giving me an interesting topic to work on. I learned quite a bit about speech recognition while working on this thesis and it wouldn’t have been possible without the help of my supervisor who always guided me along the way. I would also like to acknowledge my second supervisor, Anssi Kanervisto, who helped me countless times through out the duration of my thesis. I wouldn’t have been able to progress as far as I did without his guidance. And I would also like to thank my family who were always supportive of me in times when I told them I am trying my best.

I also want to thank the School of Computing for giving me a workspace in the speech lab for my thesis work as it enabled me to work closely with the speech team. This thesis was financially supported by the Academy of Finland (grant #313970).

(6)

(7)

1. Introduction

Deep neural networks (DNN) [1] have produced dramatic improvements over the previous baselines in machine learning tasks, by the combination of the increase of computing power, huge datasets and algorithmic tweaks [1]. DNNs are also widely used in speech applications as they have shown to give state of the art results in the respective tasks [2, 3, 4]. Voice assistants such as Google’s Assistant [5], Apple’s Siri [6], Microsoft’s Cortana [7] and Amazon’s Alexa [8] are a result of the advancement in speech recognition and have slowly become an integral part of our lives as they are embedded in our smartphones and smart speakers [9]. Other applications of speech recognition systems include transcribing speech using systems that convert speech-to- text [10] or produce artificial speech [11], assistive technologies to enhance learning [12], speaker identification and verification systems [13].

Although DNNs have shown to perform exceptionally well in classification tasks, it has proven to be difficult to peek inside the black box to know why DNNs are able to learn and perform so well [14]. There have been several methods proposed to open the black box of DNNs in image classification tasks such asdeep visualizations[15]. Activation maximization[16] andcode inversion[17] techniques have been used in previous works to visualize DNNs. It is said that when DNNs learn about discriminating between different classes, they also store that information and that information can be extracted and visualized to see what the network has learned [18]. Thus, several researchers have tried to visualize deepconvolutional neural networks(CNN) [19] to understand what they have learned [20], [16], [21]. Prior work on visualization of DNNs have shown that it is possible to extract features from neurons in the DNN layers using various techniques. When DNNs are visualized, it can be seen that DNNs typically learn lower and middle level features and that they do not pay much attention to the overall structure.

Therefore, the visualizations are not natural, and some visualization methods generate images which are barely recognizable [20].

Nguyen,et al. [22] used activation maximization technique to visualize what a CNN has learned. This technique takes a trained DNN and produces synthetic images to

(10)

activate a specific neuron [16]. They show that previous works done on visualizing CNNs do not produce realistic images of the classes it has learned and it also does not help researchers understand what features the neural network has picked up when learning to classify images. They demonstrated that by harnessing a learned prior, it enabled them to produce more realistic images from the CNN that can help understand and debug what the network has learned in a more interpretative way.

This thesis aims to use activation maximization technique to understand the inner workings of a speech classifier. We want to investigate whether it is possible to hear what the speech classifier “listens to” by synthesizing speech features extracted from a trained speech classifier. We perform experiments in which we use activation maximization technique to generate speech features which activate the target neurons in our speech classifier DNN and evaluate those speech features with a separate speech classifier to see if those maximized samples can fool another similarly trained classifier. We also perform perceptual experiments to evaluate the quality of the synthesized samples from our experiments. The work done in this thesis can be used to debug speech classifiers and understand which features a speech classifier has learned to pick up during its training.

The thesis first introduces the basics of neural networks and deep learning in Chapter 2.

Chapter 3 covers speech recognition and the task of this thesis. Chapter 4 discusses the proposed method which we use to generate samples from a trained classifier. Chapter 5 covers the experimental setup and the parameters of our models. Chapter 6 discusses the results and findings of our experiments. And finally, Chapter 7 gives our conclusion on the work done in this thesis.

(11)

2. Neural Networks

Humans have the natural ability to understand and identify complex objects and surroundings using their brain. We can train ourselves to automatically process and recognize things by using our senses. With this training, our brains can perform these tasks easily. Because our brains can easily identify objects, they have been a long subject of studies by researchers on how they are able to make these distinctions on the fly.

In biological neural networks, each neuron is connected to thousands of other neurons.

Each neuron receives signals that go to the neuron cell’s body. Signals are then typically integrated or summed together to produce a new signal. And depending on the threshold, the response will deduce whether the neuron is activated or “fires”. This response is then sent to other connecting neurons via the axon branching fibre [23].

The study of biological neural networks inspired and led to the creation ofartificial neural networks(ANNs). ANNs typically consist of inputs that are passed to hidden neurons, which are multiplied by weights (like strengths of signals). The resulting output gives the activation of the neuron which resembles whether the neuron fires or not.

2.1 Single layer perceptron

Asingle layer perceptron(SLP) [24] is the simplest kind of ANN with the ability to perform binary classification. Figure 2.1 shows a more general model of a SLP based on the model of McCulloch and Pitts [23]. There are two stages involved in the general form of ANNs. The first stage involves calculating the linear combination of the inputs.

Typically, each input has a corresponding weight attached which takes a value between 0 and 1. A biasθ is also added in the function with a value of 1. The summation function is given as:

x =

D

Õ

i=1

xiwi+θ, (2.1)

(12)

Σ

^f(x) ^y

x₁ x₂ x₃

x_D θ

Inputs Weights

w₁ w2

w₃

w_D 1

Threshold (Bias)

Summation function

x

Activation function

Output

Figure 2.1: A single layer perceptron model.

wherexirepresent the inputs,wirepresents the weights andθis the bias in the neural network.

The second stage of ANNs is to perform an activation to get the output of the network.

Activation functions introduce non-linearity by squashing the output between a certain range depending on the chosen activation function. One common activation function is thesigmoid function. As the inputx approaches a large value, the outputyapproaches closer to 1. And if the input x approaches a large negative value, the output y will approach close to −1. The function is represented by the following mathematical expression [25]:

y= 1

1+exp(−x) (2.2)

The SLP is able to classify the input data correctly only if it islinearly separable. It tries to fit a hyperplane between classes. If the data is not linearly separable, SLPs are not able to find a solution in which all cases are classified [26]. A famous example of this problem is the boolean exclusive-or (XOR) problem.

2.2 Multilayer perceptron

Multilayer perceptrons(MLPs) [26] are feedforward ANNs that can solve complex problems like the XOR problem. An MLP is like SLP except it contains of one or more hidden layers. MLPs can also use a nonlinear activation function which enables it to solve non-linear problems that were previously not possible with SLPs. Figure 2.2 shows the basic structure of a MLP. The information from the input neurons is passed layer by layer up until the output layer.

(13)

x₁

x₂

x₃

h₁

h₂

h₃

y Input

layer

hidden layer

Output layer

Figure 2.2: Structure of Multilayer perceptron.

MLPs are often applied tosupervised learningproblems. Supervised learning tasks involve learning a function that maps an input to an output. The learning process involves adjusting the network parameters, weights and biases, to reduce the error of the network’s prediction. Once trained, the network can infer a prediction on unseen input data. Examples of such networks include image classifiers that can take an image and then output the class label of it.

There are two stages in the supervised learning process. The first stage is known as the forward pass. The input is passed through the layers of the network and then the output is compared to the desired output. To measure how neural network performed, we use a loss function and calculate the error. In the second stage,backward passis performed.

This uses theback-propagation algorithm[27] to calculate the partial derivatives of the error with respect to the network parameters. And then a gradient based optimization algorithm is used to update the network parameters so as to reduce the error in the loss function.

2.3 Activation functions

Activation functionsintroduce non-linearity to determine whether the output is activated or not. The type of activation function needed typically depends on the requirements of the neural network and classification task. Activation functions can either be linear or non-linear. An example of a linear activation function (Figure 2.3a) is finding the best line of fit which splits the points in a graph. It is commonly used for binary classifications. Linear activation functions are not used indeep neural networks(DNNs) because they result in a shallow network. Nonlinear activation functions enable the

(14)

(a)Linear (b)Sigmoid

(c)Tanh (d)ReLU

Figure 2.3: Plots of activation functions [28].

model to generalize data which is more varied and enables the network to classify more outputs.

We will go through some common nonlinear activation functions which have been used in DNNs in recent years. Thesigmoid function is one of the most commonly used activation functions. It produces a curve that has an S-shape and it limits the output between 0 and 1. It is typically used where the output of the model should be interpreted as posterior probability [29]. Although it is widely used in DNNs, it does have an issue known as thevanishing gradient[1]. This refers to situations where the gradient becomes too small, causing the network to stop learning and making the training slower.

Figure 2.3b shows what the curve looks like. The mathematical form of the function is given as:

f(x)= σ(x)= 1

1+e⁻^x (2.3)

Hyperbolic tangent(tanh) is another common activation function. Tanh values range between−1 and 1 which makes the gradient of the curve to be a little stronger. However, because it can also saturate when the values ofxis either too high or too low and that makes it vulnerable to the vanishing gradient problem [1]. Figure 2.3c shows the tanh S-shaped curve. The mathematical form of the function is given by:

f(x)=2σ(2x) −1. (2.4)

Rectified linear unit[30, 31] (ReLU) is the recommended activation function in modern neural networks [1]. ReLU limits the value between 0 and infinity. The advantage that it provides is that the gradients are large whenever the units are active [1]. And when the units are not active, it outputs zero which makes it easy to train. Figure 2.3d shows

(15)

the curve of the function and the mathematical form of the function is given below:

f(x)= max(0,x). (2.5)

One disadvantage of ReLU is that if too many neurons output 0 after their activation, the network will be limited in its capacity. One generalization of ReLU which solves this issue isleaky ReLU[32]. It changes the slope to 0.01 where values are 0. This causes a leak and extends ReLU’s range.

2.4 Estimating neural network parameters

2.4.1 Loss functions

When we want to estimate the neural network parameters, we need to have some sort of way to compare how the changes in the parameters affect the performance of the neural network. And this is where loss functions come in. As DNNs approximate functions, the goal of using loss functions is to see how well the neural network approximates.

Loss functions give us a measure of how the neural network has performed. When we are training neural networks, we try to minimize this loss function so that the actual output is as close as possible to the desired output.

Mean squared error(MSE) function is a measure of the average of the squared differences between the target and the desired outputs. It is typically used as a loss function in regression problems. The result of this loss function is always positive even if the target or desired output is negative. If the MSE loss function gives a value that is nonzero, then the learning can take place by adjusting the parameters (weights and biases) of the network until the target outputtis optimal and close to the desired outputy. The mathematical equation of MSE is as follows:

EM SE = 1

2(t− y)², (2.6)

wheret is the learning target andyis the desired output.

One important requirement for the training algorithm of neural networks is that the loss function has to be differentiable. MSE is not used in DNNs because it does not produce good results when gradient-based optimization techniques are used in the learning algorithm [1]. Most modern neural networks usecross-entropy[29] loss function which fit well for classification tasks. This loss function solves the issues of small gradients

(16)

that arise with other loss functions such as MSE [1]. The specific type of cross-entropy loss function depends on the classification task i.e. binary or multi-class. Binary cross-entropy[29] loss function is used for binary classification tasks. This loss function is given by [29]:

EBCE =−

N

Õ

i=1

(tilog(yi)+(1−ti)log(1−yi)), (2.7) whereN represents the number of training samples,ti andyi represents the target and desired output for theith sample. Thecategorical cross-entropy[29] loss function for multi-class classification tasks is given by [29]:

ECCE =−

N

Õ

i=1 M

Õ

c=1

ticlog(yic), (2.8)

whereN represents the number of samples,M represents the number of class labels,tic

andyic represents the target and desired outputs for samplei belonging to classc.

2.4.2 Gradient descent

To minimize the loss function, we can usegradient descent[29]. Gradient descent uses the gradients of the loss function to find the direction which leads towards a smaller loss and then takes steps towards it. Because neural networks have multiple parametersθ, we have to calculate the gradients with respect to all the parameters which can be done by obtainingpartial derivatives[1]. The gradient descent parameter optimization can be described as follows [1]:

θt+1=θt−∇_θf(θ), (2.9)

whereθ refers to the parameter, refers to the learning rate and ∇_xf(θ)refers to the partial derivatives of the loss function with respect to the parameterθ.

2.4.3 Back-propagation

As DNNs typically contain multiple hidden layers, calculating the gradients require a special technique calledback-propagation[27]. It is used to estimate neural network parameters with gradient descenti.e. to determine the changes in weights and biases in the neural network that lead to minimization of the loss function.

(17)

In feedforward neural networks, the input X goes through the hidden layers to produce an output ˆy. The output then produces a scalar loss. This process is calledforward propagation. During learning, the loss goes backwards through the neural network to calculate the gradients. This calculation of the gradients for each neuron within the network is called backward propagation. And then the gradient descent algorithm is used to optimize the parameters of the neural network.

We will now describe the derivation of the back-propagation algorithm. Suppose we first initialize a neural network with random parameters (weights and biases). We then propagate forward to calculate the target output of the network. The target output is calculated by taking the weighted sum of inputs and then using an activation function.

The weighted sum is given as [33]:

vj = Õ

k∈K_j

wk jxk j, (2.10)

whereKj are the nodes connecting to node j from layerk. And the activation is [33]:

yj = f(vj), (2.11)

which in this derivation of back-propagation will be sigmoid, given as [33]:

f(z)= 1

1+e^−z. (2.12)

If we take the derivative of the sigmoid function, we get [33]:

df(vj)

dvj = f(vj)(1− f(vj)). (2.13)

Once we have the target output, we need to compare it with the desired output by defining an overall measure of error. If the error for each output node is calculated as the difference between the target and desired output, the overall error can be calculated as the sum of all the errors [33]. If we square these sums and then divide it by 2, we will get the MSE loss function given as [33]:

E = 1 2

Õ

j

(tj− yj)². (2.14)

When we are trying to figure out how much should the weight of a node change, we must look at how much it influences the error. The bigger the influence it has on the error, the bigger the reduction will be in the error when that weight is reduced. If we imagine the error (a function in the weight space) as a hill where the dimensions of the hill are

(18)

the weights and the height of the hill is the error, the effect that the weight is having on the neural network error can be attained by viewing the steepness of the hill described earlier [33]. The steeper it is at that specific weight, the larger the change in weight.

To calculate the steepness in the direction of the weight, we must calculate the derivative of the error with respect to that weight. If we have a node with its weight from layerkto layerj, the change of weight is defined as [33]:

∆wk j = ∂E

∂wk j

. (2.15)

The chain rule then gives us the partial derivative as [33]:

∂E

∂wk j = ∂E

∂yj

∂vj

∂wk j

, (2.16)

wherevj gives us the input’s weighted sum to the j^{t h} node according to equation (2.10).

So we can infer that [33]:

∂vj

∂wk j = yk. (2.17)

Let us treat the first two factors as one error term [33].

δj :=−∂E

∂yj

∂xj

. (2.18)

And we can calculate this error term. If we recall equation (2.12), yj = f(vj)is our activation function and we calculated its derivative in equation (2.13). That gives us [33]:

∂yj

∂vj = yj(1−yj). (2.19)

Next, we look at the first partial derivative in the error term E. We calculate the derivative of equation (2.14) with respect toyj which is [33]:

∂E

∂yj = −(tj−yj). (2.20)

Now putting together the equation (2.16), we have[33]:

∂E

∂wk j

= −(tj −yj)yj(1− yj)yk. (2.21)

In the case that j^{t h} layer is a hidden layer, calculating the partial derivative _∂y^∂E

j is not as simple. We must check how the error from the j^{t h} layer yjpropagates to the activation

(19)

of thei^{t h} layer. In mathematical form, this yields [33]:

∂E

∂yj = Õ

i∈I_j

∂E

∂yi

∂vi

∂yi

, (2.22)

where j in Ij refers to the jth node in theith hidden layer. The first two terms in the equation are from equation (2.18) except they are the error term for δi. And the last term is the derivative of equation (2.10) with respect to the inputv which gives [33]:

∂vi

∂yj = wji. (2.23)

And substituting them results in the following equation [33]:

∂E

∂yj = − Õ

i∈I_j

(δiwji). (2.24)

This will then give us equation for calculating in the hidden layer as [33]:

∂E

∂wk j

=− Õ

i∈I_j

(δiwji)yj(1−yj)yk. (2.25)

Therefore, the error terms for the hidden and output will look like [33]:

δj =











(tj− yj)yj(1−yj) if output layer Í

i∈I_j(δiwji)yj(1−yj) if hidden layer. (2.26) And then we can combine the two input and output layer case into a single equation by using the error termδj we defined in equation (2.18) resulting in [33]:

∂E

∂wk j

=−δjyk. (2.27)

2.4.4 Modern optimization techniques

The original gradient descent algorithm must go through the entire dataset to perform one iteration of updates to the parameters of the neural network which can be slow [29]. Stochastic gradient descent (SGD) algorithm is an improvement over the original gradient descent algorithm. In SGD, the neural network randomly picks one sample per iteration and updates the weights using that. Although if the dataset is very large, it can be also be slow to iterate through each sample in the dataset. So, SGD can also accept a mini-batchand average over that. Often with SGD, the neural network can be slow to

(20)

converge as it can get stuck in the local minimum of the loss function. Momentum[34]

helps to push the weight in the same direction giving it "momentum" and help speed up convergence and find the global minimum of the function.

There are other more recent approaches to optimize the training of neural networks. One method is to use adaptive learning rate which adjusts itself during the training process.

One algorithm that does this is calledAdaGrad[35]. It calculates the learning rate by taking the square root of the sum of the current and past squared gradients. So, this adapts a smaller learning rate for parameters for larger gradients and vice versa. The RMSProp[36] algorithm is an improvement of AdaGrad where it takes the exponential moving average instead of the cumulative sum of the gradients and have proven to give better results than AdaGrad. The AdaDelta[37] algorithm is another improvement on AdaGrad. It replaces the learning rate with the ratio of the root mean squared of the updates and root mean square of gradients. It only requires a learning rate at the beginning and the subsequent updates are without it. It has also shown improved performance over other optimizers. And lastly, theAdam algorithm[38] combines the benefits of momentum and RMSProp. It learns the learning rate parameter by dividing the learning rate by the exponential moving average of the gradients.

2.4.5 Overfitting and Regularization

DNNs are trained to predict and classify unseen data that is different from the data it is trained on. When the neural network performs well on unseen data, it is said to have goodgeneralizationcapability [1]. Typically, we have a separate training split from our dataset and only that is used to train the neural network. And when the neural network learns to optimize its parameters and minimizes the loss function, it is reducing its training error. But in reality, the neural network should also have a lowgeneralization error[1]. The generalization error refers to the error in classifying or predicting a new input [1]. This can be estimated with a testing set. If the training error is low but the neural network is not able to generalize well on unseen data, it is said tooverfit[1]. A neural network that memorizes the training data is not useful as the unseen data can have noise and the neural network should be able to predict those data equally well. One way to fight overfitting is to use avalidation set. It helps to tune the hyperparametersi.e.

the number of hidden neurons in the neural network [1]. It is typically a small split of the training set. While the training set is used to learn the parameters, the validation set is used to calculate the generalization error before and after training so the learning can be adjusted.

Regularization[1] refers to approaches that help in reducing the generalization error

(21)

without affecting the training error. One method to implement regularization is to introduce aregularizerterm,i.e. a penalty, inside the loss function to limit the capacity of the model [1]. Weight decay[1] is one type of L¹orL²parameter norm that can be used as a regularizer. Noise injection[1] is another regularization method to make the neural network more robust by injecting noise to the training set. Batch normalization [39] is another technique used to normalize the inputs in each layers to prevent them from having to adapt to new distributions during training. And finally,dropout[40] is another technique which is a computationally cheap and effective to prevent overfitting.

It randomly picks neurons and "drops" them by setting their activations to zero, forcing other neurons to learn more about the data instead of just memorizing it.

2.5 Convolutional neural networks

Up until the early 2000s, most pattern-recognition or classification systems required expertise and careful engineering to design feature extractors which transformed raw data into some suitable representation from which a classifier could learn to detect or classify patterns. With the rise of deep learning, feature representations could be automatically discovered by the neural networks from raw data [41]. One type of DNN that is able to automatically learn feature representations is theconvolutional neural network(CNN) [42]. CNNs were designed specifically for image recognition problems [42]. CNNs are able to process data which has a similar topology to a grid and this makes them effective for analyzing visual imagery [1]. Traditional MLPs did have some success in image classification tasks but they were not efficient for the task [42]. One of the issues was that the images, which are used for training image classifiers, can be very large. A fully connected layer with many neurons (i.e. 1000+) could easily end up with several thousands of weights just in the first layer. And with these many weights, there is already a possibility of overfitting especially if the number of samples in the dataset are quite scarce. Using ordinary fully connected networks is computationally expensive due to the number of computations that need to be performed and thus hardware limitations also played a role in the development of CNNs [42].

Another factor contributing to the development of CNNs is that regular MLPs ignore the structure of the input data. Typically, images and speech have strong 2D local structure.

These local correlations can be of advantage when features are extracted to recognize spatial or temporal objects. CNNs restricts the receptive fields of neurons in the hidden layer to be local during feature extraction [42].

Figure 2.4 illustrates the architecture of a CNN. The CNN architecture contains two

(22)

Figure 2.4: Convolutional neural networks typically consists of convolutional, pooling and fully connected layers. Image based on [29].

parts. The first part consists of convolutional and pooling layers which are responsible for detecting the features in the input. The second part consists of fully connected layer connected to the output layer which produces the classification of the input.

2.5.1 Convolution layer

A convolution in mathematical terms is an operation on two functions producing a third function [1]. It is afinite impulse response(FIR) filtering operation whose kernel is the filter that is applied to the signal. In CNNs, convolutions are performed when the filter(also known askernel) is moved over the input performing matrix multiplications.

When a CNN receives an image as input, its width and height are measured based on the number of pixels in those dimensions. An input can also have a depth of 3 layers, one for each of the color in the RGB spectrum. These depth layers are known aschannels.

The filter goes through the image one step size at a time to find patterns in the image. It is square matrix which is typically much smaller than the image itself. The step size is known as thestride. As the filter goes through the image, a dot product is taken of the filter and the patch of image it is on. The result of each dot product is then stored in a third matrix called thefeature mapor activation map. The feature map is then passed through a non-linear activation function such as tanh or ReLU. When defining the convolution layer, we can choose the number of filters we need to extract features.

We can also tune the size of these filters, how many strides it should move during the convolution process and how much padding it needs to have. Figure 2.5 shows a diagram of the convolution process.

The convolution operation is denoted with an asterisk in mathematical terms. It is given by [1]:

s(t)= (x∗w)(t). (2.28) In CNN, the convolution process is used in its discrete formi.e. if the data is in intervals

(23)

Figure 2.5: Convolution process for a 2x2 convolution filter. h[i, j] = A∗ P₁+B∗ P2+C∗P3+D∗P4. The output of the convolution process is called a feature map.

of timet, we can define a discrete convolution as [1]:

s(t)=(x∗w)(t)=

∞

Õ

m=−∞

x(a)w(t−a). (2.29)

But images are normally in the form of multi-dimensional arrays because they have height and width. Because the convolution is performed for more than one axis, the convolution process for two-dimensional data is given as [1]:

S(i, j)=(I∗K)(i j)= Õ

m

Õ

n

I(i−m, j−n)K(m,n). (2.30)

2.5.2 Pooling Layer

It is also common to add a pooling layer which reduces the dimensions of the feature maps further to reduce the number of parameters and computations in the network.

This helps to shorten the training time and prevent overfitting. One common non-linear function to implement pooling ismax pooling. It splits the image into non-overlapping rectangles and then outputs the maximum in each of them. This enables the filter to form a lower dimensional space containing only the locations of the image that show the strongest correlation of features. Figure 2.6 shows an example of how max pooling is performed.

2.5.3 Properties of CNNs

CNNs have three main advantages that make them great for image recognition tasks.

The first advantage is that they havesparse interactions[1] between units in the network.

This is because the filter is typically much smaller than the input. If a picture is of 200x200 dimension, filters as small as 3x3 dimensions can detect the features in

(24)

3 6

2 8

3 5

1 7

2 5

5 3

8 1

2 7

8 7

5 8

Max(8, 1, 2, 7) = 8

Figure 2.6: Diagram showing max pooling of size 2x2 with stride 2 on a feature map.

the input. Thus, fewer parameters are required resulting in less computations. The second advantage they have isparameter sharing[1]. When a filter slides across the input to detect features, they use the same parameters in the function computing the outputs. This makes CNNs efficient as this process requires low memory. And the final advantage of using CNNs isequivariant representation[1]. Layers in the CNN have the equivariance property which means that if the input to a function changes, the output will change the same way. In convolution context, when an object in the input is moved, its representation will move the same amount. This is useful when we know that a function applied to a small number of pixels can be applied elsewhere. As similar features appear in various parts of the image, it is practical for them to share parameters.

[1].

2.6 Autoencoders

Anautoencoderis a special type of neural network that can be used to learn a lower- dimensional code representation from data without labels [1]. Autoencoders are typically used to reproduce the input as its output. There are two parts to the autoencoder. An encoder part which takes the input and reduces it to the code representation of the most important aspects of the input. And then adecoder part which takes the code representation and re-generates the input. It is purposely designed to not re-generate the input perfectly because its purpose is to only learn and save the most important features that resemble the input [1]. Autoencoders are designed to be data-specific which means that they can only encode and decode data similar to the training data provided to it [1].

If the autoencoder is trained to compress images of dogs, it will do poorly on images of cats as the autoencoder is trained to learn dog-like features. Autoencoders are also able to automatically learn from the training data. They do not require any feature engineering as the learning algorithm is automatically able to pick up and learn the useful features of the input.

(25)

Code

Input Output

Encoder Decoder

Figure 2.7: The figure shows the typical structure of an autoencoder. It consists of an encoder which is used to learn useful features of an input. And the decoder re-creates the input from the learned code representation.

2.6.1 Undercomplete Autoencoders

When it comes to autoencoders, we are typically not interested in the output of the decoder. The reason is that autoencoders are generally used to learn the most important and useful features of the inputs. To achieve this, the code representation must have a smaller dimension than the input. This restricts the autoencoder to learn and focus on only the most useful features due to the restrictions of the code representation. An autoencoder whose encoder produces a smaller code representation than the input is called an undercomplete autoencoder [1]. The learning process is described as minimizing the loss function such as MSE:

EM SE = 1

2(t−y)². (2.31)

If we use MSE and the decoder is linear, the autoencoder also ends up learning the principal subspace (PCA) as well besides learning features of the input [1].

2.6.2 Regularized Autoencoders

When the code dimensions are the same or larger than the input dimension, it fails to learn anything useful and instead just learns to copy the input. Ideally, the size of the code dimension is determined by how complex the distribution is that needs to be modeled by the autoencoder [1]. While making the code dimension smaller does help, another method is to use loss functions that encourage the autoencoder to learn other properties from the input. Some examples of properties it could learn are [1]: sparsity

(26)

of the representation, the size of the derivatives of the representation and handling noisy or missing inputs. These regularized autoencoders can learn useful properties even if the autoencoder is given the capacity to perfectly learn the input.

Sparse Autoencoders

Asparse autoencoder[1] is an autoencoder whose training includes a sparsity penalty Ω(h), wherehis the code layer. They are typically used to learn features for another taski.e. classification task [1]. When an autoencoder is trained with a sparse penalty, it must learn unique statistical information about the input data rather than try to learn how to reproduce the input. The code representation layer in sparse autoencoders is typically bigger than the input and output layer. The learning process of the loss function with the penalty is given as:

EM SE = 1

2(t− y)²+Ω(h). (2.32) Sparse decoders are able to respond to unique statistical features of the input data that it has been trained on besides learning other useful properties [1].

Denoising Autoencoders

Denoising autoencoders[1] are autoencoders which learn useful information about an input by learning to copy the input from its corrupted version. In denoising autoencoder, the training involves minimizing the loss function between the output and the corrupted input. The applications of denoising autoencoders is to remove noise or undo corruption in the inputs.

(27)

3. Speech Recognition

Speech on a basic level is human vocal communication using a language. It is used to convey a message to the listener and it contains many levels of information. A speech signal may contain information about gender, emotion, language being spoken and attributes which identify the speaker. However, speech technologies have two principal tasks. The first is speech recognition which is recognizing the speech content, i.e.

speech-to-text task. The second is turning text into speech, known as text-to-speech (TTS) or speech synthesis. Figure 3.1 shows an overview of how speech is transcribed by a speech recognition system.

Figure 3.1: General overview of how speech is broken down into phones and then transcribed by a speech recognition system [43].

Speech recognition dates to the 1950s when researchers at Bell laboratories created the first "Digits recognizer" [44]. It was able to recognize digits by learning the formant vowels frequencies in the spoken digits. The next big breakthrough occurred when researchers at IBM labs created the "Shoe box" in the early 1960s [45]. It was able

(28)

to recognize digits and other arithmetic terms such as "plus", "minus" and "total" to perform basic arithmetic operations by only taking speech from a microphone. The dynamic time warping(DTW) [46] algorithm was developed in the following years which was used to create a classifier that could recognize words from a vocabulary of 200 words.

In the late 1960s, the mathematical concepts of Markov chains led to use ofhidden Markov models (HMM) [47, 48] in speech recognition. In the 1980s, the field of speech recognition shifted towards statistical modeling frameworks [49]. HMMs quickly became the mainstream in speech recognition tasks and replaced the DTW algorithm.

HMMs continued improving and became the standard models for speech recognition tasks for the next two decades. Around the same time, the n-gram model was first used in speech recognition [50]. Then-gram model is a statistical model which would predict the occurrence of annwords sequence. It was purely a convenient and powerful statistical representation of grammar. Since its introduction, it was commonly used for large vocabulary speech recognition systems [49].

Figure 3.2: Historical timeline of speech recognition technology before the DNN era [49].

HMMs combined with feedforward neural networks continued to dominate in speech recognition systems well into the early 2000s. ANNs were explored early on as well but the traditional models were performing better. With the availability of larger datasets and computing power in the late 2000s, deep learning came into the picture allowing for ANNs to make a comeback and surpass traditional models. [2]

(29)

3.1 Basic speech recognition pipeline

Speech recognition is the task of converting continuous speech into a sequence of words [51]. It is essentially the task of mapping an acoustic signal (speech) which contains a spoken language and then mapping it to the words spoken [1]. If we have X = (x⁽¹⁾,x⁽²⁾, ... ,x^(T)), the sequence of acoustic vectors from speech. And y= (y₁, y₂, ..., yN), the sequence of words it represents. The task ofautomatic speech recognition(ASR) is given as [1]:

f_ASR^∗ (X)=arg max

y P^∗(y|X= X), (3.1)

where P* is the true conditional distribution relating the input X to outputy[1].

Feature

Extraction Decoder "Nope"

Input:

speech

Output:

text

Language Model Acoustic

Model

Figure 3.3: Typical speech recognition systems pipeline.

In traditional speech recognition systems, features extraction methods were hand designed to train such models but with deep learning, DNNs can automatically extract features from speech. To train a speech recognition system, the raw audio data cannot be directly used and must be converted to a format in which the neural networks can work with. Thus, the raw audio files must be converted to some fixed size vectors. In machine learning, this process of converting discrete objects to vectors of real numbers is calledembedding. Embedding is performed on various kinds of data such as text, images and speech in our case to convert the data to vectors of numbers which can then be used to train our neural networks.

When extracting features from speech utterances, the audio sample is grouped into short segments calledframes. Frames are typically 25 milliseconds long with a time shift between each overlapping frame which is typically 10 milliseconds. And then the

(30)

strength of the frequencies are calculated across a set of bands. This calculation is called theshort-time Fourier transform. Each set of spectral magnitude values can be treated as a vector of numbers and they are arranged in the time order to form a 2D array. This 2D array can then be treated like a single-channel image and is known as aspectrogram [52]. Modern speech recognition tasks use short-term feature extraction methods such asmel-frequency cepstral coefficients(MFCCs) [53] and perceptual linear prediction coefficients(PLPs) [54]. The training and testing both share this feature extraction phase.

After features have been extracted from the input, the decoder in speech recognition systems finds the words in the acoustic and language models that best match the given sequence of input [51]. The decoder consists of two main parts. The first part is theacoustic model. The acoustic model takes care of establishing the statistical representations of the feature vectors computed from the speech input [51]. It is responsible for finding the sequence of words that match the feature vectors. As mentioned above, HMMs are one of the most common type of acoustic models used in speech recognition tasks before the advent of DNNs. The second part of the decoder is thelanguage model. The language models is responsible if the sequence of words comply with the grammatical structure of the language they are spoken in. With the availability of large text corpora, formal grammar can now be generalized with accurate probabilities [51]. The performance of speech recognition systems is measured by the word error rate(WER). WER can be calculated by

EW E R = S+I +D

N , (3.2)

whereSstands for substitutions, I stands for insertions, D stands for deletion and N stands for number of words in the reference.

3.2 Hidden Markov model

Nap Pizza

0.7 0.5

0.8

0.1

Figure 3.4: Markov chain process with 2 states.

HMMs is a statistical based approach for speech recognition tasks which outputs a

(31)

sequence of symbols or quantities. The concepts behind HMM-based speech recognition technologies was developed in the 1970s by groups at CMU, IBM and Bell labs [55].

HMMs are based onMarkov Chains. A Markov chain is finite and discrete in nature.

And each discrete value in the chain is associated with a state [56]:

qt ∈ {s⁽^j),j =1, 2, ... , N}. (3.3) A Markov chain can be simplified to transition probabilities [56]:

P(qt = s⁽^j)|qt−1= s⁽ⁱ⁾)=ai j(t), i, j = 1, 2, ... , N (3.4) If the transition probabilities are independent of time t, it makes what is called a homogeneous Markov chainwhich have the following state transition properties [56]:

A=[ai j], whereai j ≥ 0 and

N

Õ

j=1

ai j =1. (3.5)

If the output has a one-to-one correspondence to a state, then the Markov chain is called aobservable Markov sequence[56]. This means that each state is corresponds to an event. And because speech sequence features are typically inconsistent, the lack of randomness in Markov chains makes it too restrictive. Thus, we must introduce some randomness to Markov chains which overlap in the states. This is called the hidden Markov sequence [56]. This is introduced by assigning an observation probability distribution to each state. This makes the Markov sequence a sort of random sequence whose Markov chains cannot be directly observed. Instead, the underlying Markov chain is observed through another separate random function that is characterized by probability distribution of the observation [56]. When we use hidden Markov sequences to model a real system or source, they are referred to as hidden Markov models [56].

The list below describes what HMMs consists of [56]:

1. HMMs have transition probabilities,A= {ai j}, i, j = 1,2, ..,N, of a homogeneous Markov chain withN number of states. It is given as [56]:

ai j = P(qt = j|q_t−1 =i), i, j =1,2, .. , N. (3.6) 2. The initial state distribution probabilities of HMM are given as [56]:

π =[πi],i= 1,2, ... ,N, whereπi= P(q1=i). (3.7)

(32)

3. HMMs also have an observation probability distribution,P(ot|s⁽ⁱ⁾),i = 1, 2, ...,N. The distribution of each state can be used to get the symbolic observation probabilities, {v₁, v₂, ... , vK}, if theotis discrete. It is given as [56]:

bi(k)= P[ot = vk|qt =i], i= 1, 2, .. , N. (3.8) To generate a observation sequence,o^T₁ = o₁,:o₂, ... ,oT, with HMMs, we follow this procedure [57]:

1. Based on the initial state distributionπ, we pick initial stateq₁= Si. 2. Settto 1.

3. Choose an observationot = vk based on the statesi. 4. The system is transitioned to a new stateqt+1= si.

5. Increment t by 1 and iterate the process again from step 3. Ift < T then the procedure is terminated.

This procedure can generate observations and it can be used as a model to generate a given observation sequence [57]. In terms of ASR, HMMs can be used for two main tasks. The first task is to train the model. And the second task is to decode text given a speech utterance and a trained model.

3.3 Architecture of HMM-based recognizer

In an acoustic model, the most basic unit of speech is phone[55]. When speech is recognized, these phones are put together to form the words that represent the speech utterance.

When features are extracted from speech, the output is a sequence of fixed-size acoustic vectorsY = y₁, ..., yT. Acoustic models have a component called the decoder which finds the likelihood for sequence of wordsW = w1, ..., wK that would have generatedY. This can be given by the equation [55]:

Wˆ = arg max

W

[p(Y|W)p(W)], (3.9) where the acoustic model and language model calculates the probabilitiesp(Y|W)and p(W), respectively [55].

(33)

When the words are broken down into phones, we get a sequence of sounds. These are calledbase phones[55]. To have the possibility of variations for pronunciations, we calculate the probabilityp(Y|W)for multiple pronunciations by [55]:

p(Y|W)=Õ

Q

p(Y|Q)p(Q|W), (3.10)

whereQrepresents the potential pronunciationsQ₁, ...,QKand each one is also another sequence consisting of base phonesQk = q^(k)₁ , q^(k)₂ , ... [55]. We can then calculate

p(Q|W)=

K

Ö

k=1

p(Qk|Wk), (3.11)

where p(Q|W)gives us the probability that wordwk is pronounced by the base phone sequenceQk [55].

Each phone baseq is represented by a continuous density HMM. Eachq has transition parameters and output observation distributions represented by{ai j}and {bj()}

respectively. These are typically given asmixtures of Gaussian[55]:

bj(y)=

M

Õ

m=1

cjmN (y;µjm,Í

jm), (3.12)

whereN represents a normal distribution with µjm andÍ

jm as mean and covariance, respectively, and M represents the number of Gaussian components [55]. Because the dimensions of the acoustic vectors from speech ycan be high, one often sets the covariance to be diagonal. When the HMM-based phone models are made, the entry and exit states are nonemitting and they make it simpler to form words when phones models are combined [55].

When all the phone models are combined, they form a composite HMMQwith which the acoustic likelihood can be computed by [55]:

p(Y|Q)=Õ

X

p(X,Y|Q). (3.13)

Each composite HMM has a state sequenceX going through it. And the likelihood of p(X,Y|Q)is given by [58]:

p(X,Y|Q)=a_x(0),x₍₁₎

T

Ö

t=1

bx(t)(yt)ax(t),x(t+1). (3.14)

(34)

With a speech utterances corpus, we can estimate the parameters of the acoustic model usingexpectation maximization [59] (EM) algorithm. EM algorithm is a method to calculate the maximum likelihood estimation in cases that we have hidden variables [56].

In special cases where we have Markov chains as hidden variables, the EM algorithm is called the Baum-Welch algorithm[56]. It has two stages: E step and M step. In the E step, it computes the state occupation probabilities with a technique called the forward-backward alignment. In the M step, it estimates the means and covariances using simple weighted averages [55].

Given a speech utterance and a trained model, the decoding process involves iterating through every possible state sequence. These state sequences are made up of word sequences that could have generated the input featureY [55]. Dynamic programming is required to solve this task efficiently. If we have a sequencey₁, ... , yt, the likelihood of observing this sequence will be [55]:

φj(t)=max

x {p(y₁, ... , yt, x(t)= j|M}, (3.15) where j represents the given state at time t given the model M, we can use Viterbi algorithmto calculate the probability [55]:

φj(t)=max

i

{φ(t−1)ai j} bj(yt). (3.16) We initialize the algorithm by first settingφj(t)to 1 for the initial state and 0 for the all other states. And the probability of the word sequences is then calculated by getting the max_j{φj(T)}and that will give us the best matching word sequence representing the speech utterance.

3.4 Isolated word recognition

In this thesis, we aim to generate speech from a model trained to classify speech utterances. In specific, we want to train a speech classifier to learn and classify speech commands. And then, we want to generate speech from that classifier to “listen to” which acoustic features the classifier has picked up on when learning the speech commands.

The speech commands corpus that we want to use is made up of isolated spoken words.

Thus, we will be creating a simple DNN based word recognition system. Since our task involved classifying isolated words, we do not need to use HMMs or any other based models that are used to predict sequences. Instead, we will be using CNNs to create our speech classifier. Figure 3.5 shows an overview of the model we intend to use for our

(35)

task.

mel-spectrogram Convolutional

Layer

Pooling

Fully connected

Outputs

"Nope"

Input:

"Nope"

Figure 3.5: Overview of proposed CNN model for thesis task.

Although CNNs were specifically designed for image recognition problems, speech signals have shown properties that are suitable for this type of networks too. Because speech is also a type of time series data, CNNs can be used to train speech recognition systems and they have shown to be effective at speech tasks [3, 4, 42]. But we cannot directly use raw audio as our features, we must convert the one-dimensional continuous signals across time to a 2D spatial problem which CNNs are good at. As explained in the basic pipeline of speech recognition, this can be done by defining a window of time in which the speech utterances should fit and converting that window into an image which we described as spectrograms.

Because the inputs must be fixed size vectors in CNN, we have to do some pre-processing on our speech data to remove any time variances as some speech utterances in the corpus are not exactly 1 second long. The features are adjusted byzero paddingif they are smaller than our required dimension for feature vectors. As we do not attempt to predict sequence of words from speech, we will not be using the WER to evaluate the performance of our speech classifier. Instead, we will be measuring the performance by calculating the testing accuracy of the model by giving unseen inputs from the test set and see if the model predicts them correctly.

(36)

(37)

4. Sampling from a trained classifier

In recent years, numerous studies have been done to open the black box of DNNs.

Researchers have attempted to generate samples from image classifiers by visualizing CNNs [20, 16, 21, 17]. One of the successful methods to visualize CNNs is called activation maximization. We want to investigate whether activation maximization can be used to “listen to” what a speech classifier has learned, akin to using activation maximization to “look at” what image classifiers have learned.

Nguyen,et al. [22] proposed a new variation of activation maximization which has been successfully used to visualize DNNs. The proposed model has andeep generator network (DGN) that is used as a prior. It is trained to take in a code and output a synthetic image that looks close to the images used to train the DNN being visualized. To visualize the DNN, activation maximization is performed on the code until it produces an image which activates the target class in the DNN. Figure 4.1 shows an overview of the proposed model by Nguyen,et al. [22].

Figure 4.1: To produce a preferred input for a target class from the classifier DNN, the code representation from image generator DNN is optimized so that it outputs an image that activates the classifier DNN [22].

(38)

4.1 Activation maximization

Activation maximization is the task of finding input patterns which maximize the activation of a given unit [16]. This itself is an optimization problem. Letθbe fixed neural network parameters andhi(x;θ)the activation of the neuroni andxis the input of the neural network. The whole neural architecture is then implicitly included in the functionhi using fixed parametersθ. Inputx^∗that maximizes the activation hi is then

x^∗ := arg max

x hi(x;θ). (4.1)

Since the neural network is differentiable, we can apply gradient descent to obtain a local maximum ofhiaround some starting position xby repeating

xn+1 := xn+α∇_x_nhi(xn;θ) (4.2) until desired results or after enough iterations (αis the learning rate). Note that, after process has ended, we can take x − xˆ = x_diff, where x is the input before and ˆx is the result after performing activation maximization, respectively. So we can view the process as generating and additive "noise".

4.2 Generating classifier dependent noise

Traditionally, training a CNN uses the back-propagation technique to optimize and adjust the weights in each layer. But in activation maximization, the input image is optimized until it is concentrated to classify to the target class which is being visualized [21]. Using the method of maximizing an activation of a neuron, we can maximize the output of a classification network to create an example that is maximally considered to be of one class. With image classification we could try maximizing random noise into class chair and expect to have an image roughly representing a chair, for example.

These images could then be used to understand what neural network is looking at when it interprets an image as a chair.

In practice, this alone does not produce desired results [22]. Instead of generating images of chairs, the maximization procedure generates seemingly random spots on the image. While the images used in classification come from the distribution of natural images, the neural network can still classify any set of pixels into any class. The activation maximization technique then moves us away from the set of natural images into unnatural area, where the image can contain random pixels.

(39)

Figure 4.2: Synthesized images which activate target classes in a trained image classifier [22].

4.3 Using deep generator network

When autoencoders are trained, the bottleneck layer learns the most useful features of the inputs. And the decoders can re-construct an input from code representations.

Although decoders can be used to generate new inputs, there are generative models designed to generate new inputs. They are known asdeep generator networks(DGN) [22]. DGNs use the principles ofgenerative adversarial networks(GANs) [60] to train networks that can produce realistic inputs from random codes.

Generator

Training set

Noise

Discriminator

Fake image

Real or Fake

Fine tune training

Figure 4.3: Generative Adversarial Network training [60].

GANs are trained differently from autoencoders. A GAN consists of a generator and a discriminator. And the training of a GAN consists of generator producing an image that the discriminator can’t distinguish whether it is a real or synthesized image.

While trained GANs are able to generate realistic inputs from random codes, the inputs

(40)

generated by decoders using random codes does not produce realistic inputs as they are designed to re-construct inputs from code representations. However, generating inputs using the decoder does produce inputs that are in the natural set of inputs despite not being realistic.

Nguyenet al. [22] proposed a method which uses a decoder that is trained to generate natural images from a lower dimensional latent code. Such decoder can be obtained by, for instance, training an autoencoder on a dataset of natural images and retaining the decoder part. Nguyenet al. then use this decoder to turn a latent code into an image, and then apply activation maximization on the latent code. Since the decoder is only trained to create natural imagery, it will limit image to be in the set of natural images. With this method, authors were able to create natural imagery of what neural network learned to see. As the images are more natural, it gave researchers a better understanding of what a network has learned from its training.

In this thesis, we evaluate the use of activation maximization alone and activation maximization with such a prior in speech classification. We do this by evaluating generated samples with objective measures and perceptual evaluations.

(41)

5. Experimental Setup

5.1 Dataset

The speech dataset used in the experiments is thespeech commands corpus(v0.02) [61].

The researchers behind the collection of this dataset claim that the most important unit of recognition are usually words or short phrases and so they aimed to create a speech commands corpus which can be used to train models that can run on the devices itself [61]. Most voice assistants listen for trigger phrases (wake-up words) or commands and then sends the audio to some cloud-based service for processing. And thus, having a system which always sends the audio directly to the cloud-based service is impractical and costly. It would be more practical if the devices listen to the audio locally and only sends the audio for processing to the cloud when a trigger word or phrase is heard [61].

Another goal of the data collection effort was that this corpus can serve as a benchmark for manufacturers as they could demonstrate their accuracy and power efficiency when running the model on their devices thus enabling them to optimize their chips and potentially create more efficient models for devices [61].

Table 5.1 shows a summary of the speech commands corpus. It contains twenty core words [61]: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go",

"Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", and "Nine".

There are other spoken words added in the corpus such as [61]: "Bed", "Bird", "Cat",

"Dog", "Happy", "House", "Marvin", "Sheila", "Tree", and "Wow". These words were added for a bit of challenge in the classification as some systems need to ignore speech which does not contain triggers and the model should be able to distinguish those. One

Table 5.1: Summary of speech commands corpus.

Training Validation Testing Total

Utterances 84843 9981 11005 105829

Speakers 2112 256 250 2618

Generating speech from model trained to classify speech utterances