Difference between revisions of "Speaker ID"

From LRDE

Line 4: Line 4:
 
=== Speaker recognition and speech processing ===
 
=== Speaker recognition and speech processing ===
   
This work is conducted in collaboration with the [http://www.crim.ca/en/applied-research-centre/teams/speech-recognition Speech Recognition team] of the
+
This work is conducted in collaboration with the [http://www.crim.ca/en/applied-research-centre/teams/speech-recognition Speech Recognition team of the
[http://www.crim.ca/en/ Computer Research Institute of Montreal (CRIM - Canada)] and the
+
Computer Research Institute of Montreal (CRIM - Canada)] and the
 
[http://groups.csail.mit.edu/sls// Spoken Language Systems Group of the MIT Computer Science and Artificial Intelligence Laboratory].
 
[http://groups.csail.mit.edu/sls// Spoken Language Systems Group of the MIT Computer Science and Artificial Intelligence Laboratory].
   

Revision as of 15:04, 17 October 2013


Speaker recognition and speech processing

This work is conducted in collaboration with the [http://www.crim.ca/en/applied-research-centre/teams/speech-recognition Speech Recognition team of the Computer Research Institute of Montreal (CRIM - Canada)] and the Spoken Language Systems Group of the MIT Computer Science and Artificial Intelligence Laboratory.


VM for Speaker Verification

There are two types of representation: the generative approaches are essentially based on the modeling parameters of the speaker using a parametric model represented by a Gaussian Mixture Model (GMM). Other approaches use discriminant methods to learn to distinguish one speaker from a set of impostor Identities. The most commonly used approaches are based on the Support Vector Machine (SVM).

One of the main obstacles for the use of SVM methods for speaker verification is related to the variability of the audio recordings. There are two different approaches for extracting a vector of fixed size from the sequence of acoustic vectors:

  • The first approach combines the methods using direct acoustic vectors as a training set for the SVM.
  • The second approach relies on the use of SVM methods in the space of model parameters GMMs, the GMM-UBM methods are then used as a parameter extractor for SVM methods. Our work belongs to the second approach.

We have studied the linear kernel defined by a scalar product between two average vectors and the Gaussian kernel defined regarding the distance of the Kullback-Leibler (LB) between two GMMs. We have shown the importance of standardization of model parameters using the parameters of the UBM (M-NORM) for improving the system performance and in the case of use of the channel compensation techniques, Nuisance Attribute Projection (NAP).

One advantage of the SVM methods is their speed of scoring. This advantage has motivated us to explore the possibilities of combination with the JFA method (Joint Factor Analysis) which allows a better representation of the variability of the channel and of the speaker but which is slow. So we proposed a comparison between the two approaches, then we have successfully integrated a core derived from the cosine distance in the JFA parameter. The resulting performance is similar to the JFA method but with a much higher design speed.

In parallel, we proposed to use a fusion method in the SVM kernel space using MKL approaches (Multiple Kernel Learning). This approach has the advantage of not using a set of additional learning for learning the parameters of the merger (paramètres de la fusion ?). And the performance obtained for the fusion of a linear kernel, of a a Gaussian kernel and of a GLDS kernel are similar to the score obtained with the fusion and exceed it in some cases.


I-Vector: a simplified representation of the speaker's identity

In order to improve the performance of the JFA method we proposed a new set of parameters for the representation of speakers (I-Vector : Identity Vector). This set includes both variabilities (channel and speaker) in the same TV space (Total Variability space). We use simple techniques of pattern recognition (LDA : Linear Discriminant Analysis and WCCN : Within Class Covariance Normalization) in a small TV space to eliminate the variability of the channel and the speaker. The score is obtained directly by calculating a cosine distance between the test segment and the training segment. This approach provides nowadays the best performance, especially in the case of short duration of the training samples and the test samples.

We have simplified the standardization of scores (Z-Norm and T-Norm) by integrating it directly into the calculation of the distance. This helped reduce the computation time.


Evaluation campaign of the NIST-SRE speaker verification systems

We participated in the evaluation campaign of speaker verification systems organized by NIST: the National Institute of Standards and Technology organizes competitions in various fields, both to stimulate research and to define new standards since the beginning of the project. In 2006, we proposed a system based on a simple GMM and a SVM system based on the Gaussian kernel but without the use of channel compensation techniques.

In 2008, we proposed a merger of several systems: we merged the two systems using SVM methods with a linear kernel and a non-linear kernel with three systems using prosodic parameters allowing to represent the long-term characteristics of the speech signal.

In 2010, the proposed system uses the I-Vectors to represent the speaker and the techniques described in the previous section for the compensation of the channel. We participated for the first time in the task 10sec-10sec.

We managed to improve the performance of our system every time.