E2E-VGuard:

Adversarial Prevention for Industrial and LLM-based End-To-End Speech Synthesis

[GitHub]

Zhisheng Zhang, Derui Wang, Yifan Mi, Zhiyong Wu, Jie Gao, Yuxin Cao, Kai Ye, Minhui Xue, Jie Hao

Abstract: Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.

Contents

E2E-VGuard Workflow

Figure 1. The workflow of E2E-VGuard and end-to-end fine-tuning pipeline.

Original and Protected Speech

These samples are original and protected samples from LibriTTS dataset of speaker 5339, respectively.

Method Sample 1 Sample 2 Sample 3
Text Suddenly, a light shown down into Philip's mind. Her father was right, insane that she might not have set up for yesterday arrow. He made for the door.
Original
Protected AttackVC
AntiFake
POP
POP+ESP
SafeSpeech
E2E-VGuard (UT)
E2E-VGuard (T)

Synthesized Speech based on Fine-tuning

GPT-SoVITS

Method Sample 1 Sample 2 Sample 3
Text She had talked about it to Kinraed and her father in order to cover her regret at her lovers accompanying her father to see some new kind of harpoon about which the latter had spoken. One of these seemed of a special consequence to the Good Brothers. They each separately looked at the direction, and then at one another. And without a word, they returned with it unread into the parlor, shutting the door and drawing the green silk curtain close, the better to read it in privacy. Her life was quiet and monotonous, although hardworking. And while her hands mechanically found and did their accustomed labor, the thoughts that rose in her head always centered on Charlie Kinraid. His ways, his words, his looks, whether they all meant what she would fain believe they did, and whether meaning love at the time, such a feeling was likely to endure.
Ground Truth
clean
AttackVC
AntiFake
POP
POP+ESP
SafeSpeech
E2E-VGuard (UT)
E2E-VGuard (T)

Other Models after Fine-tuning

Models Method Sample 1 Sample 2 Sample 3
CosyVoice clean
E2E-VGuard (UT)
E2E-VGuard (T)
Llasa-8B clean
E2E-VGuard (UT)
E2E-VGuard (T)
StyleTTS2 clean
E2E-VGuard (UT)
E2E-VGuard (T)
VITS clean
E2E-VGuard (UT)
E2E-VGuard (T)

Multi-Speaker and Multilingual Evaluation

Multi-speaker test from CMU ARCTIC dataset.

Method Speaker 1 Speaker 2 Speaker 3
Text The wolf-dog thrust his gaunt muzzle toward him. Gregson was asleep when he re-entered the cabin. The very opposite is true. They are discouraged vagabonds.
Ground Truth
clean
E2E-VGuard

Chinese dataset test.

Method Speaker 1 Speaker 1 Speaker 2 Speaker 2
Text 明月村富了,昔日的穷山沟冒出了这样一轮明月,令周围的群众好生羡慕。 反正咱高低都不说啥,做人不能忘本,没有老马,哪有云霞的今天,我啥也不说了。 昙花一现的王叔文改革,由于发生在唐顺宗永贞年间,所以又被称为永贞革新。 安第斯集团由玻利维亚、哥伦比亚、厄瓜多尔、秘鲁和委内瑞拉五国组成。
Ground Truth
clean
E2E-VGuard

Synthesized Speech based on Zero-shot

These are synthesized speech from advanced zero-shot TTS models.

Models Method Sample 1 Sample 2 Sample 3
Step-Audio
-TTS
clean
E2E-VGuard (UT)
E2E-VGuard (T)
Spark-TTS clean
E2E-VGuard (UT)
E2E-VGuard (T)
Dia-1.6B clean
E2E-VGuard (UT)
E2E-VGuard (T)

Synthesized Speech based on Commercial APIs

These are synthesized speech from Industrial Products.

Models Method Sample 1 Sample 2 Sample 3
ByteDance clean
E2E-VGuard (UT)
Alibaba clean
E2E-VGuard (UT)
MiniMax clean
E2E-VGuard (UT)