RAVAS stands for Real-time Audio-Visual Anonymization System. It runs locally and can be used for video-calls. Videos can also be anonymized for running experiments that simulate the real-time anonymization scenario, as we did in the paper referenced below.

Insights

Audio and video are anonymized by independent modules, whose outputs are synchronized before being output to their respective virtual devices. These devices can be selected as microphone and camera in teleconferencing apps like Teams and Zoom. RAVAS uses a multi-threaded architecture to process the input streams efficiently.

Video anonymization

For anonymizing the video, we use Mediapipe’s face mesh to model the person’s face. This mesh is then passed to a separate rendering service, which applies the mesh to an avatar from ReadyPlayerMe. A screenshot of the rendered avatar is returned to RAVAS, which outputs it to the virtual camera after synchronizing it with the audio if needed.

For the study discussed in the paper, we used the face mesh directly as the anonymized video, but it does not anonymize the shape of the face. Participants were often capable of identifying the anonymized persons through the shape and orientation of their faces in the video. The avatar improves privacy, as the avatars have their own shapes; they are only moved with the face mesh from the person. We are currently designing an objective evaluation for video privacy that identifies anonymized persons based on the avatar’s movement. Preliminary results show that this is indeed a threat. I will report more on this soon.

Voice anonymization

For anonymizing the voice, initially we used kNN-VC, a voice converter based on WavLM. Although its output quality is great, it is not designed for real-time scenarios. We overcame this issue by chunking the audio stream into chunks of 0.5 seconds and using interpolation to mitigate the cuts that appeared in the anonymized audio. Although the quality did not degrade a lot, the latency was bad (0.7 seconds on my Macbook Pro M1). Recently, I found that the same voice conversion approach can be applied to Mimi, Kyutai’s real-time speech tokenizer. This works because Mimi’s internal representation is distilled from WavLM, and thus inherits the properties that make WavLM suitable for voice conversion. Its chunk size is of 0.08 seconds and, most importantly, it does not require future context, making it perfect for our real-time use case. Together with the avatar, RAVAS runs with a latency of 0.13 seconds in my Macbook Pro M1. Both voice anonymizers are available with several target speakers, and are compiled with ONNX for faster inference.

Note that the voice anonymization is not very strong. As I showed with Private kNN-VC, speakers anonymized with kNN-VC can be identified through the variation and duration of their speech, as they are not anonymized. The same applies to Mimi-VC. In the near future, we will improve the privacy of Mimi-VC by restricting phone variation, following the approach proposed by private kNN-VC. Anonymizing the durations is not straight forward, as altering them would break the synchronization with the video.