Paper Review - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Review of the Paper: NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Paper Link: arXiv:2403.03100

This paper presents an innovative approach to speech synthesis, specifically by introducing a codec model, FACodec, which disentangles speech representation into different subspaces. FACodec’s architecture and capabilities particularly captured my attention, as it addresses key challenges in speech tokenization and effective speech representation disentanglement.

I had the opportunity to present this paper in our CCDS Lab at IUB, which led to a thought-provoking discussion with lab members and my supervisors on the paper’s methodology and its implications in advancing speech research. Below, I am sharing the presentation slides for those interested.

Presentation Slides: Link to slides