Optimizing the Dataset for the Privacy Evaluation of Speaker Anonymizers
This study investigates how the configuration of the Librispeech data used to evaluate anonymizers affects the reliability of the results, as well as its runtime.
Links
- GitHub repository - the results can be found as a release
- Paper
Insights
The evaluation’s reliability is determined by the evaluation dataset. Our results suggest that using 20 enrollment utterances per speaker is enough to characterize them; using more does not improve the attack further. Contrary to prior experiments, our results suggest that adding more speakers to the evaluation does not make it more challenging. The similarity among speakers is more important than the amount. Experiments with other datasets should shed more light on this topic.
Another interesting outcome of this study is the low EER achieved by speech-to-text-to-speech (STT-TTS). In principle, this anonymizer is perfect, as all speaker information is lost in the transcription. Its EERs are much lower than we expected (~35%), implying that speakers can be identified through their transcribed speech.
[EDIT] I investigate this further in this paper
Regarding the evaluation runtime, we look at different strategies for reducing the size of the training data. Anonymizing the training data and training the speaker recognizer with it takes up most of the evaluation runtime (more than 90% usually). Finding a more efficient training regime for the privacy evaluation would therefore be extremely helpful. According to our results, 20% of the training data can be removed with a max. EER increase of 1.4% Even more can be removed for development efforts, as the differences between anonymizers remain. We experimented with different selection strategies based on the original speaker embeddings, but the anonymization does not preserve the relationships between utterances. This resulted in different anonymizers benefitting from different strategies, which makes the evaluation somewhat unfair.
I think that an iterative training regime such as the following could yield better performance:
- Anonymize a subset of the training data.
- Train the speaker recognizer.
- Anonymize more data for the speakers whose validation accuracy was lowest.
- Go back to step 2.
This would address how anonymizers alter the data distribution, as each anonymizer would have its own training data.