Automatic evaluation of tracheoesophageal substitute voices
Authors
More about the book
In 20 to 40 percent of all cases of laryngeal cancer, total laryngectomy has to be performed, i. e. the removal of the entire larynx. For the patient, this means the loss of the natural voice and thus the loss of the main means of communication. A popular method of voice restoration involves a shunt valve („voice prosthesis“) between trachea and pharyngoesophageal segment which establishes the tracheoesophageal (TE) substitute voice. From time to time, the substitute voice has to be evaluated by the therapist for the purpose of reporting therapy progress. This evaluation is subjective; it is therefore dependent on the particular expert's experience and similar factors. In the frame of this thesis, it was examined how automatic methods can be used in order to provide an objective means of the evaluation of substitute voices. There are some established objective measures which are, however, restricted to the evaluation of sustained vowels. In this thesis, the step from the automatic analysis of vowel recordings to text recordings is done. For judging speech quality objectively in a real communication situation, the analysis of entire words and sentences is necessary because the intelligibility of a substitute voice in a dialogue is a substantial criterion for evaluation. Automatic word recognition methods were applied to a standard text that was read out by the test persons. Information on the intelligibility of the individual speakers was gained by the comparison of word recognition rates with reference evaluation data from human experts. The use of a prosody module allowed to extract not only acoustic information on the speaker's voice, but it also measured individual speaking characteristics. The inter-rater variability among humans was compared to the automatic analysis results, and the main finding was that the correlation between human and automatic ratings was as good as the agreement among the human rater group. The automatic recognition could be slightly improved on distant-talking recordings by the use of mu-law features which are modified Mel-Frequency Cepstrum Coefficients (MFCC). Artificially reverberated training data for the recognizer is another possibility to achieve better recognition rates even when the reverberation in the test data does not match the acoustic properties of the training data. This is a step towards a therapy session where the patients will not be required to wear a headset any more.