• Ei tuloksia

Audio signal representations

2.3 Audio signal representations

As already discussed, audio signal consists of samples of measured instantaneous sound pressure values. This stream of samples is the digitaltime-domainrepresentation of the captured sound. When dealing with sampled sound waves rather than a continuous signal the details of the sound pressure wave falling between consecutive samples are lost. According to Nyquist-Shannon sampling theorem, the idea of which was initiated in Nyquist (1928), the frequency content of audio signal, which is above half the sampling frequencyfSmay not be distinguished from the frequency components0...fS/2. That is, the contents of sound frequencies above fS/2 will bealiased over the contents at frequencies0...fS/2. Fortunately, the real life sounds tend to have their main frequency content at low frequencies, and the power of sound wave components usually suppress significantly towards high frequencies. Also the microphones have a finite frequency response, naturally muting the high frequencies.

The above mentioned fact that sound pressure waves are initiated by vibrating sound sources leads to that sound is well represented also infrequency-domain, which will be introduced below. Another fact mentioned above, that different sounds are invariably mixed together in real life recordings, is the major problem in tasks of audio information retrieval. Thus the mathematical basics of how sounds intertwine together within a recorded signal are also considered below. Finally, the characteristics of human hearing and respective audio signal analysis methods, which are popularly utilized for audio signal analysis are considered. The presented methods have been found highly successful in many audio information retrieval tasks, and they have been utilized also in my publications.

Time-Frequency representation of audio signal

Since a sound wave is a consequence of vibrations of sound sources, it is reasonable to represent the audio signal in terms of the wave frequencies that it contains. However, the frequency content of the sound usually changes all the time, thus atime-frequency domain representation of the audio signal is widely used in audio signal processing. It is implemented in most audio signal processing frameworks by cutting the signal into clips, i.e. frames, of equal length and estimating the frequency contents of each frame xnusing the Discrete Fourier transform (DFT) (Oppenheim et al. (1999)), which in this setting may be called also as short-time Fourier transform (STFT). The DFT coefficients X(k)for frequency binsk= 0...b(Tw−1)/2c,Twbeing the number of samples within the audio frame, are given by

X(k) =

Tw−1

X

t=0

x(t)e−i2πkt/Tw. (2.1)

The complex valued DFT coefficientX(k)indicates the magnitude and the phase of the sound wave at frequencykFS/Tw.

For smoother time-frequency domain representation, the consecutive signal frames are usually taken with overlap, and the frequency analysis is focused into the middle part of the frame using a window function. A popularly utilized window function in audio signal processing is e.g. the Hann window (Harris (1978)) which weighs the samples of the frame as

w(t) = 0.5

1−cos 2πt

Tw−1

, t= 0...Tw−1. (2.2)

8 Chapter 2. Audio and video signals

Figure 2.1:Audio signal (below), with a spoken sentence: “Would you like some chocolate?”, and (above) a spectrogram representation of it as frame-wise STFT magnitudes.

Suitable length of the audio frame and the time shift between consecutive overlapping frames depends on application. They are set considering the requirements of time resolution, frequency resolution and the latency limits of an application. For automatic speech recognition, a coarse frequency information has been shown to give enough information, while the time accuracy has to be good as the pronunciation of the shortest phonemes last for only few milliseconds, and an average syllabic rate of speech is 4 Hz (Arnfield et al. (1995)). Thus, for ASR, the frame length 10-30 ms and frame shift 5-10 ms are generally used. In the ASR framework of publications II, I an V a frame length 25 ms and shift of 5 ms are used. In music, the important aspects are the rhythm, melody and harmonic progression, where shortest harmonic unities last about 100 ms. Thus for detailed frequency content analysis of music, longer frames up to even 200 ms may be used. In audio dereverberation task of Publication III we found out, that using the frame length 50 ms and shift of 25 ms gave the best results.

The time-frequency representation of audio signal, also called aspectrogramand illus-trated in Figure 2.1, results in a trade-off between the time and frequency resolution of the representation. The longer signal frame, the more frequencies for the representation may be estimated, but since the values are averages over the frame, the resolution in time dimension becomes smoothed. On the other hand, the shorter the time frame, the resolution in time dimension is well preserved, but approximation of frequency parameter values becomes coarser.

Characteristics of combination sound from multiple sources

The air pressure waves of sounds from multiple sources, which build up the recorded sound wave, intertwine together as a sum of the individual waves. The recorded audio signal is thus

x(t) =s1(t) +s2(t) +s3(t) +..., (2.3) wheresi is the wave component from one sound source. In terms of the frequency representation of the sound combination and the individual sources

X(k) =S1(k) +S2(k) +S3(k) +..., (2.4) the complex valued sound componentsSi(k)from different sources may either reinforce or suppress each other at a certain frequency binkof the combination. This depends on

2.3. Audio signal representations 9 the phases of the individual componentsSi(k)at this frequency.

From audio application point of view, this enables sound canceling by generating audio wave in opposite phase as the original sound. On the other hand, the audio signal processing task of sound source separation can be seen to be very challenging.

Human hearing based audio analysis

Characteristics of human hearing are often incorporated into audio signal analysis. It has been shown to be advantageous for many signal processing tasks on audio signal, the most prominent examples being audio coding and automatic speech recognition. For efficient audio coding all the details of the sound wave that are not perceived by human, may be discarded without causing any noticeable degradation on the sound. For audio interpretation tasks, including automatic speech recognition, taking into account the non-even sensitivity of human ear to different frequencies (Zwicker (1961)) has been found out to be crucial for the algorithms O’Shaughnessy (2000). The cochlea of human ear reacts to sound in terms of the frequency components that the sound contains. The normal frequency range of hearing for human is 20 Hz - 20 kHz, and the sensitivity of ear to distinguish different frequencies within this range is not even. The frequency resolution capability of ear has been found out to be more specific for low audio frequencies than for high audio frequencies. That is, human ear tends to integrate sound frequencies very close to each other. This phenomenon is called frequency masking. Concerning the low frequency sounds, the frequency masking causes the frequencies only very close to each other to be integrated. Regarding the high frequency components, the frequency masking phenomenon integrates the perception over broad frequency range. Human ear performs also time-domain masking, and masking effect is different depending on the overall sound contents (Zwicker and Fastl (1999)), but those aspects are generally not considered for audio information retrieval tasks.

Reflecting the nonlinear frequency resolution of human hearing, a Mel-frequency scale MEL= 2595 log10

1 + f

700

(2.5) has been introduced in Stevens et al. (1937). For audio signal analysis to approximate the frequency masking of ear, a fixed set of band pass filters based on the Mel frequency scale are generally utilized. Traditionally the Mel-filter bank analysis is implemented on the signal discrete time Fourier transform ( DTFT) magnitude spectrum using triangular basis functions, i.e. filters, which perform the spectral integration according to Mel scale.

The spectral energyE(b)of frequency bandbis obtained using a triangular band-filterfb

as

E(b) =

K−1

X

k=0

|X(k)| ·fb(k), (2.6)

whereX(k)is the DTFT coefficient by (2.1). This kind of DTFT-based Mel scale magni-tude spectrograms have been used as audiofeaturesfor automatic speech recognition task in Publications II,I, III, and V.

For audio information retrieval tasks, immensely utilized Mel-filter bank energy based features are the Mel-frequency Cepstral Coefficients (MFCC), which have been found out to be highly useful already in Davis and Mermelstein (1980). To compute the MFCC features, the above presented Mel-scale energy spectrumE(b), b= 1,2, ..., Bis further transformed using the discrete cosine transform (DCT). The DCT transform provides

10 Chapter 2. Audio and video signals decorrelation of features, uncorrelatedness being a desirable property of a feature for many algorithms. Often, to incorporate information about the context of an audio frame, the amount of change in each MFCC coefficient among a few consecutive audio frames are considered as additional features called∆MFCC (deltaMFCC). I have used MFCC and∆MFCC features for the experiments in Publications IV, V, and VI.