Anonymous Speech Editing Demo

Compare multiple speech editing systems on deletion / insertion / substitution subsets.

Anonymous demo. This page is intended for peer review demonstration. Audio is served as static files; please allow a moment for the first play.

Imperceptible speech editing demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic token space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of `Edit Content, Preserve Acoustics'. Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained TTS model as an implicit critic—complemented by strict intelligibility and duration constraints—we effectively align the edited semantic token with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness and perceptual quality.

Audio Demo

Loading audio/samples.jsonl