Аннотация
For many years, i-vector based audio embedding techniques were the dominant
approach for speaker verification and speaker diarization applications.
However, mirroring the rise of deep learning in various domains, neural network
based audio embeddings, also known as d-vectors, have consistently demonstrated
superior speaker verification performance. In this paper, we build on the
success of d-vector based speaker verification systems to develop a new
d-vector based approach to speaker diarization. Specifically, we combine
LSTM-based d-vector audio embeddings with recent work in non-parametric
clustering to obtain a state-of-the-art speaker diarization system. Our system
is evaluated on three standard public datasets, suggesting that d-vector based
diarization systems offer significant advantages over traditional i-vector
based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000
CALLHOME, while our model is trained with out-of-domain data from voice search
logs.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)