DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Document detail

ID

oai:arXiv.org:2406.19135

Topic

Electrical Engineering and Systems... Computer Science - Artificial Inte...

Author

Park, Hyun Joon Kim, Jin Sob Shin, Wooseok Han, Sung Won

Year

2024

listing date

7/3/2024

Keywords

expressive style styles diffusion speech

Metrics

Abstract

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability.

In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations.

Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech.

Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability.

In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS.

DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies.

Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone.

Demos are available here.

;Comment: Preprint

Park, Hyun Joon,Kim, Jin Sob,Shin, Wooseok,Han, Sung Won, 2024, DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability