• Ei tuloksia

Adaptive post-filtering of speech in mobile communications


Academic year: 2023

Jaa "Adaptive post-filtering of speech in mobile communications"



The post-processing worked by locating the formants of a voiced speech frame by extracting the peaks of the LP spectrum. The performance of the post-processing algorithm was investigated by analyzing its effects on different voiced sounds and by comparing the filter with other post-filters.


More specifically, the processing of the speech signal is carried out when it has already reached the receiver's mobile terminal and just before it is played to the listener's ear. This is called post-processing because it occurs at the end of the communication chain after the speech has been sent and decoded over the communication channel.


  • Linear prediction
  • Post-processing of speech
  • Performance measures
  • Focus of this work

The performance of the post-filter compared to the conventional post-filter in [9] was studied with two types of subjective listening tests. In Equation (2.9), the polynomials PM(z) and QM(z) represent the LSP polynomials derived from the Mth order LP polynomial,AM(z).


The general setting

As mentioned earlier, in this scenario the speech signal reaching the post-processing block is different from the original spoken at the transmitting end. To simulate this, the speech samples must go through some processing. The lower frequencies of 50 to 300 Hz of the speech signal are suppressed by using a high-pass filter with a cut-off frequency around 200 Hz.

After filtering, the active power level of the speech sample is set to -26 dBov using the method defined in ITU-T standard P.56 [31]. When the speech signal has passed the post-processing block, it is located in node B in Figure 3.2. However, here the speech is processed into samples that are part of a larger entity rather than a continuous stream of data.

The post-processing algorithm

This means that part of the current frame is filtered to avoid sudden transitions between frames. This shifts some of the energy from the lower frequency bands to the higher frequencies. In equation (3.7), the parameter µ is the coefficient of the first-order linear prediction of the post-filter Hpf(z).

The amplitude response of the entire post-filter structure with Hpf(z) and Htilt(z) in cascade is derived in Appendix A. To avoid artifacts caused by sudden transitions between consecutive tuned frames, the coefficients of the post-filter are interpolated between the current image and the next picture. First, a neutral formant postfilter is constructed for the current frame, and the filter's delay line is initialized with zeros.

SII calculation

Subjective tests


Given the parameter spacing, listeners would have M different samples to listen to and would have to choose the best one. Grid spacing in this case would greatly affect the results. Automotive noise was added to the test samples, resulting in a sentence SNR of -5 dB.

All the listeners were naïve by their own assessment, and 14 of them had a technical background. The idea of ​​the demonstration was to familiarize the subjects with the graphical user interface and also to select an appropriate volume level for the test. For each of the six samples, all the locations of the subject's clicks on the screen were saved.

Results and discussion

One purpose of all these measures was to make sure that the subjects understood the idea of ​​the test correctly, and also to eliminate possible accidental erroneous choices. There was also a distinct possibility that the inherent logic of the two-dimensional parameter space would not be understood. Most of the listeners characterized the lower right corner as the clearest and that the speech was easier to separate from the noise.

On the other hand, some listeners said that the naturalness of the voice was compromised and that the processing on the x-axis made the speaker sound urgent and nervous. Most of the listeners felt that different samples were affected in a very similar way by the arrangement. It was also quickly tested whether the average values ​​would correlate with the fundamental frequencies of the loudspeakers.

Speech Intelligibility Index

However, when the value on the x-axis goes below 0.7, the contour curves start to exclude some of the higher values ​​on the y-axis. When the first formant is attenuated, the energy is shifted and some of it lands in the frequency band of the second formant. However, they all understood the samples completely at least with some of the parameter values.

When looking at the final choices of the listeners marked with black squares, it is difficult to see a clear correlation with the contours of the speech intelligibility index. On the other hand, the differences between the SII values ​​on the figures are small compared to the entire range of the index. These results can be used to investigate the effects of the band importance function and to determine the most appropriate one.

Objective evaluation

Post-filter gains

In each figure, the average value of the parameter for that particular loudspeaker obtained from the subjective tests is marked with a black square. The gains at 0 Hz, shown in Figures 5.1 (a) and 5.1 (b), are mostly quite small, about -3 dB for the average of the samples, as well as for the filter selected after. The average attenuations for the first estimated formants shown in Figures 5.2(a) and 5.2(b) reach much larger values ​​as expected.

Gains for mean values ​​from subjective tests and for selected values ​​after the filter are around -11 dB. However, average values ​​from subjective tests have gains of only around 4 to 6 dB. Gains for mean values ​​and for selected parameter values ​​are around 1.5 dB.

Typical behavior

At the top of the numbers, the average gains reach 20 dB, which means that the top of the formant must be extremely sharp. Gains for mean values ​​and for selected parameter values ​​are around 1.5 dB. a) Spectra of the original and processed signal. The frequencies marked with a∼ are the estimated frequencies of the first and second formants. a) Spectra of the original and processed signal.

In other words, the amplitude of the original signals is higher than that of the processed signals in this region. Because the ofr2 value is only 0.93, the second formant is not significantly sharper than the rest of the peaks. In other words, some of the energy from the first formant frequency has been shifted to an even lower frequency, and therefore the processed signal has a higher amplitude than the original signal in that frequency region.

Comparison with other post-filters

The standard post filter has small gains throughout the spectrum and does not change the speech signal very much. However, a noticeable difference is that the AMR postfilter tries to attenuate the valleys between formants, but the developed postfilter actually improves some of them. Another post-filter is also used for comparison, namely the differentiation filter of Hall et al.

The proposed post-filter has a much flatter frequency response in the higher frequency band, which helps make speech more natural. The difference filter also increases the valleys between some formants, but the effect is stronger than with the post-developed filter. A major issue that clearly shows in the previous figures is that the proposed post-filter shifts the first formant to a lower frequency band as mentioned earlier.


The contribution of this work

We also compared the postfilter with some realizations of postfilters previously used by other authors. Based on these evaluations, it was found that the post-filter mostly works as desired. It's hard to judge how good a postfilter is in the absence of official listening test results.

However, the proposed post-filter succeeds in bringing a new, more adaptive method for post-processing in high noise levels. With the developed post-filter, it is possible to fine-tune the processing, so that the quality of the speech does not suffer.

Practical implementation

It also presented some unexpected and unwanted behaviors that need further study to determine their causes and the resulting audible effects on processed speech. Previously, this problem was approached with simple, static filter structures that improve intelligibility at the expense of quality. For the pre-emphasis, a first-order LP analysis is needed, and the pre-emphasis is done with a first-order FIR filter.

In an ideal situation, the decoder block would be able to pass the linear prediction coefficients of the speech frame to the postfilter. After the formant filter has been formed, a first-order LP is needed to determine the slope. In addition to these, there are of course several steps, such as peak picking from the LP spectrum, which require some calculations.

Further research

It could be taken into account in the post-filtering and thus perhaps improve the results further. The next logical step would be to perform a formal subjective test to obtain a real measure of post-filter performance. It is doubtful whether the performance of the post-filter would be affected much by the change of language from Finnish to English, but it would be interesting to see how the speech intelligibility index values ​​discussed in Chapter 4 would be affected by the switch.

In other words, when the SNR of the resulting noisy speech signal would be too low, the post-processing can be more extreme, and then the post-processing effects will gradually decrease as the SNR increases. In this way, the quality of the speech would not be affected if the conditions were good. Also, the type of noise or the characteristics of the noise can have an impact on post-processing.


Yeldener, “An adaptive post-filtering technique based on the modified Yule-Walker filter,” in ICASSP '99: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. 14] ITU-T , “Recommendation G. 729: Coding of Speech at 8 kbit/s Using Conjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP)," March 1996. 21] ITU-T, "Recommendation P.862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs,” February 2001.

Fairbanks, “The phonemic differentiation test: The rhyme test,” The Journal of the Acoustical Society of America, vol. Hazan, “The SUS Test: A Method for Assessing the Comprehension of Text-to-Speech Synthesis Using Semantically Unpredictable Sentences,”. Versfeld, “A speech intelligibility index-based approach to predict speech acceptance threshold for sentences in fluctuating noise for normal-hearing listeners,” The Journal of the Acoustical Society of America , vol.

Derivation of the filter amplitude response

Test sentences

Test instructions



The main goal of this project was to find suitable methods for sample collection from the zebrafish oral and gut microbiotas, DNA extraction of small sample volume, an

Suomessa on tapana ylpeillä sillä, että suomalaiset saavat elää puhtaan luonnon keskellä ja syödä maailman puhtaimpia elintarvikkeita (Kotilainen 2015). Tätä taustaa

tieliikenteen ominaiskulutus vuonna 2008 oli melko lähellä vuoden 1995 ta- soa, mutta sen jälkeen kulutus on taantuman myötä hieman kasvanut (esi- merkiksi vähemmän

− valmistuksenohjaukseen tarvittavaa tietoa saadaan kumppanilta oikeaan aikaan ja tieto on hyödynnettävissä olevaa & päähankkija ja alihankkija kehittävät toimin-

Vuonna 1996 oli ONTIKAan kirjautunut Jyväskylässä sekä Jyväskylän maalaiskunnassa yhteensä 40 rakennuspaloa, joihin oli osallistunut 151 palo- ja pelastustoimen operatii-

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

Russia has lost the status of the main economic, investment and trade partner for the region, and Russian soft power is decreasing. Lukashenko’s re- gime currently remains the

Sekä modernismin alkuaika että sen myöhemmät variaatiot kuten brutalismi ovat osoituksia siitä, että yhteiskunnan jälleenrakennusta ja arkkitehtuurin roolia osana sitä ei

The goal of the research has been to study methods and techniques for richer human-computer interaction, and to investigate interconnection and user preferences concerning speech