Time Delay Neural Networks-Based Universal Background Model for Speaker Recognition



In speaker recognition, deep neural networks (DNN) have recently proved to be more efficient than traditional gaussian mixture models (GMM) for collecting Baum-Welch statistics that can be used for i-vector extraction. However, this type of architecture can be too slow at evaluation time, requiring a GPU to achieve real-time performance. We show how triphone posteriors produced by a time delay neural network (TDNN) can be used to create a more lightweight supervised GMM serving as a universal background model (UBM) inside the i-vector framework. The equal error rate (EER) obtained with this approach is compared to those obtained with traditional GMM-based UBM.