The post-processing worked by locating the formants of a voiced speech frame by extracting the peaks of the LP spectrum. The performance of the post-processing algorithm was investigated by analyzing its effects on different voiced sounds and by comparing the filter with other post-filters.
Introduction
More specifically, the processing of the speech signal is carried out when it has already reached the receiver's mobile terminal and just before it is played to the listener's ear. This is called post-processing because it occurs at the end of the communication chain after the speech has been sent and decoded over the communication channel.
Background
- Linear prediction
- Post-processing of speech
- Performance measures
- Focus of this work
The performance of the post-filter compared to the conventional post-filter in [9] was studied with two types of subjective listening tests. In Equation (2.9), the polynomials PM(z) and QM(z) represent the LSP polynomials derived from the Mth order LP polynomial,AM(z).
Implementation
The general setting
As mentioned earlier, in this scenario the speech signal reaching the post-processing block is different from the original spoken at the transmitting end. To simulate this, the speech samples must go through some processing. The lower frequencies of 50 to 300 Hz of the speech signal are suppressed by using a high-pass filter with a cut-off frequency around 200 Hz.
After filtering, the active power level of the speech sample is set to -26 dBov using the method defined in ITU-T standard P.56 [31]. When the speech signal has passed the post-processing block, it is located in node B in Figure 3.2. However, here the speech is processed into samples that are part of a larger entity rather than a continuous stream of data.
The post-processing algorithm
This means that part of the current frame is filtered to avoid sudden transitions between frames. This shifts some of the energy from the lower frequency bands to the higher frequencies. In equation (3.7), the parameter µ is the coefficient of the first-order linear prediction of the post-filter Hpf(z).
The amplitude response of the entire post-filter structure with Hpf(z) and Htilt(z) in cascade is derived in Appendix A. To avoid artifacts caused by sudden transitions between consecutive tuned frames, the coefficients of the post-filter are interpolated between the current image and the next picture. First, a neutral formant postfilter is constructed for the current frame, and the filter's delay line is initialized with zeros.
SII calculation
Subjective tests
Methods
Given the parameter spacing, listeners would have M different samples to listen to and would have to choose the best one. Grid spacing in this case would greatly affect the results. Automotive noise was added to the test samples, resulting in a sentence SNR of -5 dB.
All the listeners were naïve by their own assessment, and 14 of them had a technical background. The idea of the demonstration was to familiarize the subjects with the graphical user interface and also to select an appropriate volume level for the test. For each of the six samples, all the locations of the subject's clicks on the screen were saved.
Results and discussion
One purpose of all these measures was to make sure that the subjects understood the idea of the test correctly, and also to eliminate possible accidental erroneous choices. There was also a distinct possibility that the inherent logic of the two-dimensional parameter space would not be understood. Most of the listeners characterized the lower right corner as the clearest and that the speech was easier to separate from the noise.
On the other hand, some listeners said that the naturalness of the voice was compromised and that the processing on the x-axis made the speaker sound urgent and nervous. Most of the listeners felt that different samples were affected in a very similar way by the arrangement. It was also quickly tested whether the average values would correlate with the fundamental frequencies of the loudspeakers.
Speech Intelligibility Index
However, when the value on the x-axis goes below 0.7, the contour curves start to exclude some of the higher values on the y-axis. When the first formant is attenuated, the energy is shifted and some of it lands in the frequency band of the second formant. However, they all understood the samples completely at least with some of the parameter values.
When looking at the final choices of the listeners marked with black squares, it is difficult to see a clear correlation with the contours of the speech intelligibility index. On the other hand, the differences between the SII values on the figures are small compared to the entire range of the index. These results can be used to investigate the effects of the band importance function and to determine the most appropriate one.
Objective evaluation
Post-filter gains
In each figure, the average value of the parameter for that particular loudspeaker obtained from the subjective tests is marked with a black square. The gains at 0 Hz, shown in Figures 5.1 (a) and 5.1 (b), are mostly quite small, about -3 dB for the average of the samples, as well as for the filter selected after. The average attenuations for the first estimated formants shown in Figures 5.2(a) and 5.2(b) reach much larger values as expected.
Gains for mean values from subjective tests and for selected values after the filter are around -11 dB. However, average values from subjective tests have gains of only around 4 to 6 dB. Gains for mean values and for selected parameter values are around 1.5 dB.
Typical behavior
At the top of the numbers, the average gains reach 20 dB, which means that the top of the formant must be extremely sharp. Gains for mean values and for selected parameter values are around 1.5 dB. a) Spectra of the original and processed signal. The frequencies marked with a∼ are the estimated frequencies of the first and second formants. a) Spectra of the original and processed signal.
In other words, the amplitude of the original signals is higher than that of the processed signals in this region. Because the ofr2 value is only 0.93, the second formant is not significantly sharper than the rest of the peaks. In other words, some of the energy from the first formant frequency has been shifted to an even lower frequency, and therefore the processed signal has a higher amplitude than the original signal in that frequency region.
Comparison with other post-filters
The standard post filter has small gains throughout the spectrum and does not change the speech signal very much. However, a noticeable difference is that the AMR postfilter tries to attenuate the valleys between formants, but the developed postfilter actually improves some of them. Another post-filter is also used for comparison, namely the differentiation filter of Hall et al.
The proposed post-filter has a much flatter frequency response in the higher frequency band, which helps make speech more natural. The difference filter also increases the valleys between some formants, but the effect is stronger than with the post-developed filter. A major issue that clearly shows in the previous figures is that the proposed post-filter shifts the first formant to a lower frequency band as mentioned earlier.
Conclusion
The contribution of this work
We also compared the postfilter with some realizations of postfilters previously used by other authors. Based on these evaluations, it was found that the post-filter mostly works as desired. It's hard to judge how good a postfilter is in the absence of official listening test results.
However, the proposed post-filter succeeds in bringing a new, more adaptive method for post-processing in high noise levels. With the developed post-filter, it is possible to fine-tune the processing, so that the quality of the speech does not suffer.
Practical implementation
It also presented some unexpected and unwanted behaviors that need further study to determine their causes and the resulting audible effects on processed speech. Previously, this problem was approached with simple, static filter structures that improve intelligibility at the expense of quality. For the pre-emphasis, a first-order LP analysis is needed, and the pre-emphasis is done with a first-order FIR filter.
In an ideal situation, the decoder block would be able to pass the linear prediction coefficients of the speech frame to the postfilter. After the formant filter has been formed, a first-order LP is needed to determine the slope. In addition to these, there are of course several steps, such as peak picking from the LP spectrum, which require some calculations.
Further research
It could be taken into account in the post-filtering and thus perhaps improve the results further. The next logical step would be to perform a formal subjective test to obtain a real measure of post-filter performance. It is doubtful whether the performance of the post-filter would be affected much by the change of language from Finnish to English, but it would be interesting to see how the speech intelligibility index values discussed in Chapter 4 would be affected by the switch.
In other words, when the SNR of the resulting noisy speech signal would be too low, the post-processing can be more extreme, and then the post-processing effects will gradually decrease as the SNR increases. In this way, the quality of the speech would not be affected if the conditions were good. Also, the type of noise or the characteristics of the noise can have an impact on post-processing.
Bibliography
Yeldener, “An adaptive post-filtering technique based on the modified Yule-Walker filter,” in ICASSP '99: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. 14] ITU-T , “Recommendation G. 729: Coding of Speech at 8 kbit/s Using Conjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP)," March 1996. 21] ITU-T, "Recommendation P.862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs,” February 2001.
Fairbanks, “The phonemic differentiation test: The rhyme test,” The Journal of the Acoustical Society of America, vol. Hazan, “The SUS Test: A Method for Assessing the Comprehension of Text-to-Speech Synthesis Using Semantically Unpredictable Sentences,”. Versfeld, “A speech intelligibility index-based approach to predict speech acceptance threshold for sentences in fluctuating noise for normal-hearing listeners,” The Journal of the Acoustical Society of America , vol.
Derivation of the filter amplitude response
Test sentences
Test instructions