文章目录

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech
- method
- 单阶段训练模型的原理
- 对齐原理
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch
- method
- pitch shift by VocGAN
- model architecture
- experiment
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner
- introduction
- method
- - alignment search
- experiment
- - 单人模型-LJspeech
  - 多人模型-VCTK
One TTS Alignment To Rule Them All
- abstract
- method
- experiment
- ref-【RAD-TTS】

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

作者：Dan Lim
单位：Kakao
kenlee写的github实现

method

在这里插入图片描述

fatsspeech2 + HiFiGan的联合训练实现的单阶段text2wav
decoder没有选用mel作为中间态
duration的预测，联合训练的模块，参考了One TTS Alignment To Rule Them All。
ps/es在扩帧的时候，没有采用原始的简单的repeat，选择的是gaussian upsampling with fixed temperature。

单阶段训练模型的原理

在这里插入图片描述

对齐原理

Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

韩国NCSOFT 公司

method

motivation：FastPitch对基频的控制能力有限，当预测的基频和说话人基频均值相差较大时，生成语音质量会明显下降。
主要的创新在pitch数据增广的方法。

pitch shift by VocGAN

在这里插入图片描述

数据增广的方式增加基频的范围：（注意调整的原则：修改后的语音听感上和说话人音色一致。）
（1）参数的方式，WORLD提取，修改pitch再合成；这样会使得合成语音质量下降。（2）非参数的方式，TD-PSOLA，直接修改语音中的基频。但是很容易造成音色失真。
本文使用新的方法进行数据增广。使用VocGAN进行基频修改，相比于参数化的方法，不会产生不精确的数据问题；相比于TD-PSOLA，可以修改的基频范围更广泛。
对应的FastPitch的训练方法也针对增广数据进行修改：真实数据和增广的数据迭代训练，当使用增广数据训练的时候，duration & picth predictor的参数不更新。（保证对于相同的文本，只会产生一种预测数据）
VocGAN的生成包含多个分辨率的阶段，低分辨率的生成基频相关的信息，高分辨率的生成谱包络。输入真实mel，低频信息修改* $\alpha$ ，高频不变。得到音色不变，基频修改后的语音。

model architecture

在这里插入图片描述

experiment

在这里插入图片描述

TriniTTS: Pitch-controllable End-to-end TTS without External Aligner

AIRS Company

introduction

当前TTS的三个方向：（1）end2end的结构；（2）prosody control；（3）aligner without extra model;
motivation：使用TriniTTS一次性解决以上三个问题，并且不使用flow对齐，在CPU速度比VITS快28.84%，合成质量相当。（1）基频确定&可控；（2）end2end的方式；（3）不需要额外对齐。
两阶段的TTS：要么因为acoustic model和vocoder特征不匹配造成性能下降；要么使用acoustic model的输出训练vocoder，这种方法的性能严重依赖acoustic model的性能。
end2end-TTS：VITS，EATS，Wave-Tacotron。这些方法使用了mel spec提取特征，有可能给模型过多的真实mel信息参考。而且，比如VITS，从VAE 的latent representation采样生成语音，但是由于采样存在随机性，会导致韵律和基频不可控。

method

在这里插入图片描述

alignment search

本文（TriniTTS）使用动态规划&attention 算法进行对齐估计。
Query ( $h_{text}$ )和Key （ $h_{spec}$ ），使用soft alignment map进行映射。（参考"One TTS alignment to rule them all") 根据单向对齐原则，找出所有可能的对齐路径，用CTC Loss计算。

duration predictor：根据 $h_{text}$ 和对齐结果的时长训练。

experiment

单人模型-LJspeech

和fastspeech对比基频控制能力以及修改基频以后合成语音自然度。VITS使用开源的模型。

个人问题：和fastspeech控制基频的方法没有本质区别，为什么能证明结果更好？？

多人模型-VCTK

在这里插入图片描述

One TTS Alignment To Rule Them All

nvidia
2022 ICASSP

abstract

提出一种对齐的方法，可以广泛应用于自回归和非自回归的对齐学习。
The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior.
参考了RAD-TTS

method

在这里插入图片描述

experiment

在这里插入图片描述

ref-【RAD-TTS】

nvidia官方实现

2022 interspeech TTS

文章目录

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

method

单阶段训练模型的原理

对齐原理

Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

method

pitch shift by VocGAN

model architecture

experiment

TriniTTS: Pitch-controllable End-to-end TTS without External Aligner

introduction

method

alignment search

experiment

单人模型-LJspeech

多人模型-VCTK

One TTS Alignment To Rule Them All

abstract

method

experiment

ref-【RAD-TTS】

相关文章

语音合成(TTS)应用方案一二三

TTS 文字转语音研究，效果原来这么好。

TTS技术简单介绍和Ekho（余音）TTS的安装与编程

ChatAudio 通过TTS + STT + GPT 实现语音对话（低仿微信聊天）

.net实现简单语音朗读(TTS)功能

UE5文本转语音TTS插件

android原生TTS+语音引擎实现纯离线免费的中英TTS

利用正则表达式改善TTS听书效果