DATE:
2015-01-30
UNIVERSAL IDENTIFIER: http://hdl.handle.net/11093/241
UNESCO SUBJECT: 3325 Tecnología de las Telecomunicaciones
DOCUMENT TYPE: doctoralThesis
ABSTRACT
Speech technologies have become of paramount importance in our everyday life thanks to technological improvements and years of research. The available resources and tools for speech processing have made it possible to automatically process audio contents, task that was manually performed in the past but the huge amount of multimedia information available nowadays makes this manual processing prohibitive. Moreover, interdisciplinary research have been boosted in fields that can take advantage of speech technologies, such as medicine and psychology; speech technologies have a great potential for the study and treatment of different pathologies such as mental health disorders or speech related impairments.
This Thesis contributes to the improvement of emotional state detection techniques; specifically, we focused on the emotion recognition and depression detection tasks. The extraction of speaker’s emotional state implies several steps: first, the speech parts of an audio stream must be detected, in order to discard those non-speech parts that provide no information about the speaker’s emotional state; after this, speaker segmentation must be performed, because before extracting information about a speaker we must detect the boundaries between different speakers; on some occasions, it might be interesting to perform speaker diarization in order to know which parts of an audio stream are spoken by the same speaker; and finally, information about the emotional state can be extracted. This Thesis presents contributions at these four stages: audio segmentation, speaker segmentation, speaker diarization and emotional state detection; on this latter stage we focused on continuous emotion recognition and depression detection.
Audio segmentation systems usually achieve acceptable results when performing simple tasks such as speech detection, but state-of-art systems still find it difficult to segment audio contents when music and noise are present. We propose a framework for the fusion of audio segmentation systems, in order to enhance their strengths and dim their weaknesses. This decision-level fusion strategy consists on the estimation of the reliability of the different audio segmentation systems when classifying audio into the different classes; this reliability is estimated by extracting class-conditional probabilities from a confusion matrix, which is obtained by analysing system performance on some training data. The information extracted from these class-conditional probabilities was used to propose different reliability estimates.
Speaker segmentation systems have two types of error: false alarms, which consist on detecting speaker change-points that are not actual change-points; and mis-detections, which consist on missing actual change-points. Audio contents are more likely to suffer one type of error or another depending on their nature: recordings with long speaker turns are prone to false alarms, while recordings with short speaker turns, such as dialogues, are prone to mis-detections. We present different rejection strategies that aim at reducing the false alarm rate in Bayesian information criterion (BIC) based speaker segmentation systems; this rejection strategies can take into account the confidence on the candidate change-points as well as the statistical properties of the change-points process. Moreover, a strategy to reduce the mis-detection rate is also presented, which is specially focused on improving segmentation performance on TV programmes, where the presence of dialogues and dynamic discourse is very common.
Selecting the number of clusters is an unresolved issue related to speaker diarization, as there are no satisfactory solutions to this problem in the literature. In this Thesis, we propose a criterion to select the number of clusters which aims at maximizing the extra-cluster similarity while minimizing the intra-cluster similarity. An analysis of different speech segment representations was also performed, showing that the application of linear discriminant analysis (LDA) increased the separability of the classes, enhancing the performance of the proposed criterion to select the number of clusters. Continuous emotion recognition from speech is a challenging task, specially when an emotional level must be predicted at every instant of time, as the information contained in a frame of speech do not give enough information to do such estimation. Also, classification techniques cannot be straightforwardly applied to this task because there is not a finite number of classes. In this Thesis, we propose two subspace projection techniques for continuous emotion recognition; the iVector paradigm was applied to this task, aiming at reducing the influence of speaker and channel variabilities; the eigen-space approach, widely used in tasks such as speaker recognition, was also applied to continuous emotion recognition.
To finish, we presented a study on the influence of different feature sets in the depression detection task; this was motivated by the fact that there are many studies in the literature about similar topics, but the use of different databases, features, performance measures and depression detection approaches makes it very difficult to extract any remarkable conclusions about which features are better at representing depressive speech. This study was complemented with an analysis of the type of discourse for depression detection, in which either read or spontaneous speech was used to estimate the level of depression severity of the speakers, and we also assessed some feature-level fusion techniques for this task.