Wavefit: An Iterative and Non-Autoregressive Neural Vocoder Based On Fixed-Point Iteration
Wavefit: An Iterative and Non-Autoregressive Neural Vocoder Based On Fixed-Point Iteration
x0
<latexit sha1_base64="vkOcqP3l5iv4PNLUqQWZXpBdbVs=">AAAClXichVHLSsNQFJzGV62vqgsFN2JRXJVTERVBKCriTqu2ig9KEq8azIskLWroD7gWXIiCggvxE1y68Qdc+AnisoIbF56kAVFRT0ju3Lkz52Y4iq1rrkf0FJPq6hsam+LNiZbWtvaOZGdXwbVKjiryqqVbzpoiu0LXTJH3NE8Xa7YjZEPRxaqyPxOcr5aF42qWueId2mLLkHdNbUdTZY+p9U3F8A8qRZ8qxWSK0hRW/0+QiUAKUS1ayTtsYhsWVJRgQMCEx1iHDJefDWRAsJnbgs+cw0gLzwUqSLC3xCrBCpnZff7u8m4jYk3eBz3d0K3yLTq/Djv7MUiPdENVeqBbeqb3X3v5YY/gXw55VWpeYRc7jnuX3/51Gbx62Pt0/eFQWP2/LsjmYQcTYSaNM9ohE6RVa/7y0Wl1eXJp0B+iK3rhnJf0RPec1Cy/qtc5sXSGBA8q830sP0FhJJ0ZS1NuNJWdjkYWRx8GMMxzGUcW81hEnu81cYJzXEg90pQ0K83VpFIs8nTjS0kLH6bYlaw=</latexit>
yT
<latexit sha1_base64="IoBXzpB/i/H3onkl9rKzdMt4ZYs=">AAACk3ichVE9S8NQFD3Gr1q/qiIILmJRnMqtiIoupTq4CK21WmilJPFVQ/NFkhZq8Q84ujjoouAg/gRHF/+Agz9BHCu4OHiTBkRFvSHv3XfeOTc5HMXWNdcjeuqQOru6e3ojfdH+gcGh4djI6I5r1RxV5FVLt5yCIrtC10yR9zRPFwXbEbKh6GJXqa7597t14biaZW57DVvsGfKBqVU0VfYYKpQUo9k4Lm+XY3FKUFBTP5tk2MQRVsaK3aGEfVhQUYMBARMe9zpkuPwUkQTBZmwPTcYc7rTgXuAYUdbWmCWYITNa5fWAT8UQNfnsz3QDtcpf0fl1WDmFGXqkG2rRA93SM73/OqsZzPD/pcG70tYKuzx8MpF7+1dl8O7h8FP1h0Jh9v8835uHCpYDTxp7tAPEd6u29fWjs1ZuZWumOUtX9MI+L+mJ7tmpWX9Vr7Ni6xxRDir5PZafzc58IrmYoOxCPJUOI4tgEtOY41yWkMIGMsgHeZziHBfSuLQqpaX1NlXqCDVj+FLS5geNfZTF</latexit>
x0
<latexit sha1_base64="vkOcqP3l5iv4PNLUqQWZXpBdbVs=">AAAClXichVHLSsNQFJzGV62vqgsFN2JRXJVTERVBKCriTqu2ig9KEq8azIskLWroD7gWXIiCggvxE1y68Qdc+AnisoIbF56kAVFRT0ju3Lkz52Y4iq1rrkf0FJPq6hsam+LNiZbWtvaOZGdXwbVKjiryqqVbzpoiu0LXTJH3NE8Xa7YjZEPRxaqyPxOcr5aF42qWueId2mLLkHdNbUdTZY+p9U3F8A8qRZ8qxWSK0hRW/0+QiUAKUS1ayTtsYhsWVJRgQMCEx1iHDJefDWRAsJnbgs+cw0gLzwUqSLC3xCrBCpnZff7u8m4jYk3eBz3d0K3yLTq/Djv7MUiPdENVeqBbeqb3X3v5YY/gXw55VWpeYRc7jnuX3/51Gbx62Pt0/eFQWP2/LsjmYQcTYSaNM9ohE6RVa/7y0Wl1eXJp0B+iK3rhnJf0RPec1Cy/qtc5sXSGBA8q830sP0FhJJ0ZS1NuNJWdjkYWRx8GMMxzGUcW81hEnu81cYJzXEg90pQ0K83VpFIs8nTjS0kLH6bYlaw=</latexit>
| yt , c)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
p(yt
GAN GAN
1
MMSE
Loss
yt
<latexit sha1_base64="OspnX+yHSbX7qkurs5eErJTXzw8=">AAACk3ichVE9S8NQFD3G7/pVFUFwEUvFqdyKqOhS1MFF0Go/oEpJ4qsG80XyWqjFP+Do4qCLgoP4Exxd/AMO/gRxVHBx8CYNiIp6Q96777xzbnI4mmsaviR6bFFa29o7Oru6Yz29ff0D8cGhvO9UPV3kdMd0vKKm+sI0bJGThjRF0fWEammmKGgHy8F9oSY833DsLVl3xY6l7tlGxdBVyVBxW7Ma9aOyLMcTlKKwxn826ahJIKp1J36LbezCgY4qLAjYkNybUOHzU0IaBJexHTQY87gzwnuBI8RYW2WWYIbK6AGve3wqRajN52CmH6p1/orJr8fKcSTpga7phe7php7o/ddZjXBG8C913rWmVrjlgePRzbd/VRbvEvufqj8UGrP/5wXeJCqYDz0Z7NENkcCt3tTXDk9fNheyycYkXdIz+7ygR7pjp3btVb/aENkzxDio9PdYfjb56VR6NkUbM4nMUhRZF8YwgSnOZQ4ZrGIduTCPE5zhXBlRFpUlZaVJVVoizTC+lLL2AdJ9lOU=</latexit>
Loss
G(Id F)
<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>
| yt , c)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
tively denoises an input signal, and trains a deep neural network p(yt 1
yt
<latexit sha1_base64="feNpNBS1NA6fRrK97NgwcBF5KBw=">AAACl3ichVHNLsRQGD3qf/wNNsRGTIiNyVcRxIaQiKXBIDEyaeui0b+0dyYZzbyAB2Bh4SexEI9gaeMFLDyCWJLYWPjaaSIIvqa95557znd78umeZQaS6LFOqW9obGpuaU21tXd0dqW7e9YDt+QbIm+4lutv6logLNMReWlKS2x6vtBs3RIb+sFCdL5RFn5gus6arHhi29b2HHPXNDTJVKGg22GlWgzlmFotpjOUpbgGfwI1ARkkteymb1HADlwYKMGGgAPJ2IKGgJ8tqCB4zG0jZM5nZMbnAlWk2FtilWCFxuwBf/d4t5WwDu+jnkHsNvgWi1+fnYMYpge6phe6pxt6ovdfe4Vxj+hfKrzqNa/wil1Hfatv/7psXiX2P11/OHRW/6+LsknsYjrOZHJGL2aitEbNXz48eVmdWRkOR+iSnjnnBT3SHSd1yq/GVU6snCLFg1K/j+UnWB/PqpNZyk1k5uaTkbVgAEMY5blMYQ5LWEae7/VwjDOcK/3KrLKoLNWkSl3i6cWXUnIfWkyWYw==</latexit>
1 yt
<latexit sha1_base64="feNpNBS1NA6fRrK97NgwcBF5KBw=">AAACl3ichVHNLsRQGD3qf/wNNsRGTIiNyVcRxIaQiKXBIDEyaeui0b+0dyYZzbyAB2Bh4SexEI9gaeMFLDyCWJLYWPjaaSIIvqa95557znd78umeZQaS6LFOqW9obGpuaU21tXd0dqW7e9YDt+QbIm+4lutv6logLNMReWlKS2x6vtBs3RIb+sFCdL5RFn5gus6arHhi29b2HHPXNDTJVKGg22GlWgzlmFotpjOUpbgGfwI1ARkkteymb1HADlwYKMGGgAPJ2IKGgJ8tqCB4zG0jZM5nZMbnAlWk2FtilWCFxuwBf/d4t5WwDu+jnkHsNvgWi1+fnYMYpge6phe6pxt6ovdfe4Vxj+hfKrzqNa/wil1Hfatv/7psXiX2P11/OHRW/6+LsknsYjrOZHJGL2aitEbNXz48eVmdWRkOR+iSnjnnBT3SHSd1yq/GVU6snCLFg1K/j+UnWB/PqpNZyk1k5uaTkbVgAEMY5blMYQ5LWEae7/VwjDOcK/3KrLKoLNWkSl3i6cWXUnIfWkyWYw==</latexit>
1
(DNN) for minimizing an adversarial loss calculated from interme- | yt , c) F G(Id F)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
<latexit sha1_base64="+802x5u8mV5scpnnoIFSdVyQLts=">AAAClnichVHLSsNQFBzjq9ZX1Y3gRiyKq3IqouJCRPGxrNWqUEWSeNsG8yK5LdTiD7gXF4Ki4EL8BJdu/AEX/QRxqeDGhSdpQFTUE5I7d+7MSSZHc03Dl0T1JqW5pbWtPdYR7+zq7ulN9PVv+k7Z00VOd0zH29ZUX5iGLXLSkKbYdj2hWpoptrSDxeB8qyI833DsDVl1xa6lFm2jYOiqZCq/Y6mypKtmbfloL5GkFIU1/BOkI5BEVBkncYcd7MOBjjIsCNiQjE2o8PnKIw2Cy9wuasx5jIzwXOAIcfaWWSVYoTJ7wM8i7/IRa/M+6OmHbp3fYvLtsXMYo/RIN/RCD3RLT/T+a69a2CP4liqvWsMr3L3e48H1t39dFq8SpU/XHw6N1f/rgmwSBcyEmQzO6IZMkFZv+CuHpy/rs9nR2hhd0TPnvKQ63XNSu/KqX6+J7BniPKj097H8BJsTqfRUitYmk/ML0chiGMIIxnku05jHKjLIhX/2BOe4UAaVOWVJWWlIlabIM4AvpWQ+AICdlgg=</latexit>
<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>
p(yt 1
diate outputs at all iterations. Subjective (side-by-side) listening tests y0
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>
Output y0
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>
Output y0
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>
Output
showed no statistically significant differences in naturalness between
human natural speech and those synthesized by WaveFit with five it-
erations. Furthermore, the inference speed of WaveFit was more
than 240 times faster than WaveRNN. Audio demos are available at
google.github.io/df-conformer/wavefit/. Fig. 1. Overview of (a) DDPM, (b) GAN-based model, and (c) pro-
Index Terms— Neural vocoder, fixed-point iteration, generative posed WaveFit. (a) DDPM is an iterative-style model, where sam-
adversarial networks, denoising diffusion probabilistic models. pling from the posterior is realized by adding noise to the denoised
intermediate signals. (b) GAN-based models predict y0 by a non-
iterative DNN F which is trained to minimize an adversarial loss
1. INTRODUCTION calculated from y0 and the target speech x0 . (c) Proposed WaveFit
is an iterative-style model without adding noise at each iteration, and
Neural vocoders [1–4] are artificial neural networks that generate a F is trained to minimize an adversarial loss calculated from all inter-
speech waveform given acoustic features. They are indispensable mediate signals yT −1 , . . . , y0 , where Id and G denote the identity
building blocks of recent applications of speech generation. For operator and a gain adjustment operator, respectively.
example, they are used as the backbone module in text-to-speech
(TTS) [5–10], voice conversion [11, 12], speech-to-speech trans-
Since a DDPM-based neural vocoder iteratively refines speech
lation (S2ST) [13–15], speech enhancement (SE) [16–19], speech
waveform, there is a trade-off between its sound quality and compu-
restoration [20, 21], and speech coding [22–25]. Autoregressive
tational cost [42], i.e., tens of iterations are required to achieve high-
(AR) models first revolutionized the quality of speech genera-
fidelity speech waveform. To reduce the number of iterations while
tion [1, 26–28]. However, as they require a large number of sequen-
maintaining the quality, existing studies of DDPMs have investigated
tial operations for generation, parallelizing the computation is not
the inference noise schedule [44], the use of adaptive prior [45, 46],
trivial thus their processing time is sometimes far longer than the
the network architecture [47, 48], and/or the training strategy [49].
duration of the output signals.
However, generating a speech waveform with quality comparable to
To speed up the inference, non-AR models have gained a lot human natural speech in a few iterations is still challenging.
of attention thanks to their parallelization-friendly model architec-
tures. Early successful studies of non-AR models are those based Recent studies demonstrated that the essence of DDPMs and
on normalizing flows [3, 4, 29] which convert an input noise to a GANs can coexist [50, 51]. Denoising diffusion GANs [50] use a
speech using stacked invertible deep neural networks (DNNs) [30]. generator to predict a clean sample from a diffused one and a dis-
In the last few years, the approach using generative adversarial net- criminator is used to differentiate the diffused samples from the clean
works (GANs) [31] is the most successful non-AR strategy [32–41] or predicted ones. This strategy was applied to TTS, especially to
where they are trained to generate speech waveforms indistinguish- predict a log-mel spectrogram given an input text [51]. As DDPMs
able from human natural speech by discriminator networks. The lat- and GANs can be combined in several different ways, there will be a
est member of the generative models for neural vocoders is the de- new combination which is able to achieve the high quality synthesis
noising diffusion probabilistic model (DDPM) [42–49], which con- with a small number of iterations.
verts a random noise into a speech waveform by the iterative sam- This study proposes WaveFit, an iterative-style non-AR neural
pling process as illustrated in Fig. 1 (a). With hundreds of iterations, vocoder, trained using a GAN-based loss as illustrated in Fig. 1 (c).
DDPMs can generate speech waveforms comparable to those of AR It is inspired by the theory of fixed-point iteration [52]. The pro-
models [42, 43]. posed model iteratively applies a DNN as a denoising mapping that
removes noise components from an input signal so that the out- 2.1.1. Prior adaptation from conditioning log-mel spectrogram
put becomes closer to the target speech. We use a loss that com-
bines a GAN-based [34] and a short-time Fourier transform (STFT)- To reduce the number of iterations in inference, PriorGrad [45] and
based [35] loss as this is insensitive to imperceptible phase differ- SpecGrad [46] introduced an adaptive prior N (0, Σ), where Σ is
ences. By combining the loss for all iterations, the intermediate out- computed from c. The use of an adaptive prior decreases the lower
put signals are encouraged to approach the target speech along with bound of the ELBO, and accelerates both training and inference [45].
the iterations. Subjective listening tests showed that WaveFit can SpecGrad [46] uses the fact that Σ is positive semi-definite and
generate a speech waveform whose quality was better than conven- that it can be decomposed as Σ = LL> where L ∈ RD×D and
>
tional DDPM models. The experiments also showed that the audio is the transpose. Then, sampling from N (0, Σ) can be written
quality of synthetic speech by WaveFit with five iterations is compa- as = L˜ using ˜ ∼ N (0, I), and Eq. (2) with an adaptive prior
rable to those of WaveRNN [27] and human natural speech. becomes
2
LSG = L−1 ( − Fθ (xt , c, βt )) 2
. (4)
2. NON-AUTOREGRESSIVE NEURAL VOCODERS
SpecGrad [46] defines L = G+ M G and approximates L−1 ≈
A neural vocoder generates a speech waveform y0 ∈ R given a D
G+ M −1 G. Here N K × D matrix G represents the STFT, M =
log-mel spectrogram c = (c1 , ..., cK ) ∈ RF K , where ck ∈ RF diag[(m1,1 , . . . , mN,K )] ∈ CN K×N K is the diagonal matrix repre-
is an F -point log-mel spectrum at k-th time frame, and K is the senting the filter coefficients for each (n, k)-th time-frequency (T-F)
number of time frames. The goal is to develop a neural vocoder so bin, and G+ is the matrix representation of the inverse STFT (iSTFT)
as to generate y0 indistinguishable from the target speech x0 ∈ RD using a dual window. This means L and L−1 are implemented as
with less computations. This section briefly reviews two types of time-varying filters and its approximated inverse filters in the T-F
neural vocoders: DDPM-based and GAN-based ones. domain, respectively. The T-F domain filter M is obtained by the
spectral envelope calculated from c with minimum phase response.
2.1. DDPM-based neural vocoder The spectral envelope is obtained by applying the 24th order lifter to
the power spectrogram calculated from c.
A DDPM-based neural vocoder is a latent variable model of x0 as
q(x0 | c) based on a T -step Markov chain of xt ∈ RD with learned 2.1.2. InferGrad
Gaussian transitions, starting from q(xT ) = N (0, I), defined as
In conventional DDPM-models, since the DNNs have been trained
T
Z Y as a Gaussian denoiser using a simplified loss function as Eq. (2),
q(x0 | c) = q(xT ) q(xt−1 | xt , c) dx1 · · · dxT . (1) there is no guarantee that the generated speech becomes close to the
RDT t=1 target speech. To solve this problem, InferGrad [49] synthesizes y0
By modeling q(xt−1 | xt , c), y0 ∼ q(x0 | c) can be realized as a from a random signal via Eq. (3) in every training step, then addi-
recursive sampling of yt−1 from q(yt−1 | yt , c). tionally minimizes an infer loss LIF which represents a gap between
In a DDPM-based neural vocoder, xt is generated by the dif- generated speech y0 and the target speech x0 . The loss function for
fusion process that gradually adds Gaussian noise to the wave- InferGrad is given as
form according√to a noise schedule {β1 , ..., βT } given by p(xt | LIG = LWG + λIF LIF , (5)
xt−1 ) = N 1 − βt xt−1 , βt I . This formulation enables us
√ sample x
to √t at an arbitrary timestep t in a closed form Q as xt = where λIF > 0 is a tunable weight for the infer loss.
ᾱt x0 + 1 − ᾱt , where αt = 1 − βt , ᾱt = ts=1 αs , and
∼ N (0, I). As proposed by Ho et al. [53], DDPM-based neural 2.2. GAN-based neural vocoder
vocoders use a DNN F with parameter θ for predicting from xt as
Another popular approach for non-AR neural vocoders is to adopt
ˆ = Fθ (xt , c, βt ). The DNN F can be trained by maximizing the
adversarial training; a neural vocoder is trained to generate a speech
evidence lower bound (ELBO), though most of DDPM-based neu-
waveform where discriminators cannot distinguish it from the target
ral vocoders use a simplified loss function which omits loss weights
speech, and discriminators are trained to differentiate between target
corresponding to iteration t;
and generated speech. In GAN-based models, a non-AR DNN F :
LWG = k − Fθ (xt , c, βt )k22 , (2) RF K → RD directly outputs y0 from c as y0 = Fθ (c).
One main research topic with GAN-based models is to design
where k·kp denotes the `p norm. Then, if βt is small enough, loss functions. Recent models often use multiple discriminators at
q(xt−1 | xt , c) can be given by N (µt , γt I), and the recursive sam- multiple resolutions [34]. One of the pioneering work of using mul-
pling from q(yt−1 | yt , c) can be realized by iterating the following tiple discriminators is MelGAN [34] which proposed the multi-scale
formula for t = T, . . . , 1 as discriminator (MSD). In addition, MelGAN uses a feature match-
ing loss that minimizes the mean-absolute-error (MAE) between the
1 βt discriminator feature maps of target and generated speech. The loss
yt−1 = √ yt − √ Fθ (yt , c, βt ) + γt (3)
αt 1 − ᾱt functions of the generator LGAN GAN
Gen and discriminator LDis of the GAN-
based neural vocoder are given as followings:
1−ᾱ
where γt = 1−ᾱt−1 βt , yT ∼ N (0, I) and γ1 = 0.
t RGAN
The first DDPM-based neural vocoders [42, 43] required over 1 X
LGAN
Gen = −Dr (y0 ) + λFM LFM
r (x0 , y0 ) (6)
200 iterations to match AR neural vocoders [26, 27] in naturalness RGAN r=1
measured by mean opinion score (MOS). To reduce the number of
RGAN
iterations while maintaining the quality, existing studies have inves- 1 X
tigated the use of noise prior distributions [45, 46] and/or better in- LGAN
Dis = max(0, 1−Dr (x0 ))+ max(0, 1+Dr (y0 )) (7)
RGAN r=1
ference noise schedules [44].
where RGAN is the number of discriminators and λFM ≥ 0 is a tunable An example of fixed-point iteration is the following proximal
weight for LFM . The r-th discriminator Dr : RD → R consists of H point algorithm which is a generalization of iterative refinement [63]:
sub-layers as Dr = DrH ◦ · · · ◦ Dr1 where Drh : RDh−1,r → RDh,r .
Then, the feature matching loss for the r-th discriminator is given by ξn+1 = proxL (ξn ), (13)
H−1
1 X 1 where proxL denotes the proximity operator of a loss function L,
LFM
r (x0 , y0 ) = kdhx,0 − dhy,0 k1 , (8)
H −1 Dh,r
h=1 h 1 i
proxL (ξ) ∈ arg min L(ζ) + kξ − ζk22 . (14)
ζ 2
where dha,b is the outputs of Drh−1 (ab ).
As an auxiliary loss function, a multi-resolution STFT loss is of- If L is proper lower-semicontinuous convex, then proxL is firmly
ten used to stabilize the adversarial training process [35]. A popular (quasi-) nonexpansive, and hence a sequence generated by the
multi-resolution STFT loss LMR-STFT consists of the spectral conver- proximal point algorithm converges to a point in Fix(proxL ) =
gence loss and the magnitude loss as [35, 36, 54]: arg minζ L(ζ), i.e., Eq. (13) minimizes the loss function L. Note
RSTFT
that Eq. (14) is a negative log-likelihood of maximum a posteriori
1 X estimation based on the Gaussian observation model with a prior
LMR-STFT (x0 , y0 ) = LSc Mag
r (x0 , y0 ) + Lr (x0 , y0 ), (9)
RSTFT proportional to exp(−L(·)). That is, Eq. (13) is an iterative Gaus-
r=1
sian denoising algorithm like the DDPM-based methods, which mo-
where RSTFT is the number of STFT configurations. LSc r and
tivates us to consider the fixed-point theory.
LMag
r correspond to the spectral convergence loss and the mag- The important property of a firmly quasi-nonexpansive map-
nitude loss of the r-th STFT configuration as LSc r (x0 , y0 ) =
ping is that it is attracting, i.e., equality in Eq. (11) never occurs:
kX0,r − Y0,r k2 /kX0,r k2 and LMag 1
r (x0 , y0 ) = Nr Kr kln(X0,r ) −
kT (ξ) − φk2 < kξ − φk2 . Hence, applying T always moves an
ln(Y0,r )k1 , where Nr and Kr are the numbers of frequency bins input signal ξ closer to a fixed point φ. In this paper, we consider
and time-frames of the r-th STFT configuration, respectively, and a denoising mapping as T that removes noise from an input signal,
X0,r ∈ RNr Kr and Y0,r ∈ RNr Kr correspond to the amplitude and let us consider an ideal situation. In this case, the fixed-point
spectrograms with the r-th STFT configuration of x0 and y0 . iteration in Eq. (12) is an iterative denoising algorithm, and the at-
State-of-the-art GAN-based neural vocoders [33] can achieve a tracting property ensures that each iteration always refines the sig-
quality nearly on a par with human natural speech. Recent stud- nal. It converges to a clean signal φ that does not contain any noise
ies showed that the essence of GANs can be incorporated into because a fixed point of the denoising mapping, T (φ) = φ, is a
DDPMs [50, 51]. As DDPMs and GANs can be combined in sev- signal that cannot be denoised anymore, i.e., no noise is remained.
eral different ways, there will be a new combination which is able to If we can construct such a denoising mapping specialized to speech
achieve the high quality synthesis with a small number of iterations. signals, then iterating Eq. (12) from any signal (including random
noise) gives a clean speech signal, which realizes a new principle of
neural vocoders.
3. FIXED-POINT ITERATION
Extensive contributions to data science have been made by fixed- 4. PROPOSED METHOD
point theory and algorithms [52]. These ideas have recently been
combined with DNN to design data-driven iterative algorithms [55– This section introduces the proposed iterative-style non-AR neural
60]. Our proposed method is also inspired by them, and hence fixed- vocoder, WaveFit. Inspired by the success of a combination of the
point theory is briefly reviewed in this section. fixed-point theory and deep learning in image processing [60], we
A fixed point of a mapping T is a point φ that is unchanged by adopt a similar idea for speech generation. As mentioned in the last
T , i.e., T (φ) = φ. The set of all fixed points of T is denoted as paragraph in Sec. 3, the key idea is to construct a DNN as a denois-
ing mapping satisfying Eq. (11). We propose a loss function which
φ ∈ RD | T (φ) = φ . approximately imposes this property in the training. Note that the
Fix(T ) = (10)
notations from Sec. 2 (e.g., x0 and yT ) are used in this section.
Let the mapping T be firmly quasi-nonexpansive [61], i.e., it satisfies
4.1. Model overview
kT (ξ) − φk2 ≤ kξ − φk2 (11)
The proposed model iteratively applies a denoising mapping to refine
for every ξ ∈ RD and every φ ∈ Fix(T ) (6= ∅), and there ex- yt so that yt−1 is closer to x0 . By iterating the following procedure
ists a quasi-nonexpansive mapping F that satisfies T = 21 Id + 12 F, T times, WaveFit generates a speech signal y0 :
where Id denotes the identity operator. Then, for any initial point,
the following fixed-point iteration converges to a fixed point of T :1 yt−1 = G (zt , c) , zt = yt − Fθ (yt , c, t) (15)
Table 2. Side-by-side test results with their 95% confidence inter- 5.5. Comparison with GAN-based models
vals. A positive score indicates that Method-A was preferred.
The MOS and SxS results using the LibriTTS 1,000 evaluation sam-
Method-A Method-B SxS p-value ples are shown in Table 3. These results show that WaveFit-5 was
WaveFit-3 InferGrad-3 0.375 ± 0.073 0.0000 significantly better than MB-MelGAN, and there was no significant
WaveFit-3 WaveRNN −0.051 ± 0.044 0.0027 difference in naturalness between WaveFit-5 and HiFi-GAN V 1. In
terms of the model complexity, the RTF and model size of WaveFit-
WaveFit-5 InferGrad-5 0.063 ± 0.050 0.0012 5 are 0.07 and 13.8M, respectively, which are comparable to those
WaveFit-5 WaveRNN −0.018 ± 0.044 0.2924 of HiFi-GAN V 1 reported in the original paper [33], 0.065 and
WaveFit-5 Ground-truth −0.027 ± 0.037 0.0568 13.92M, respectively. These results indicate that WaveFit-5 is com-
parable in the model complexity and naturalness with the well-tuned
HiFi-GAN V 1 model on LibriTTS dataset.
DDPM, which is required to make q(xt−1 | xt , c) Gaussian [53,70].
We found that some outputs from WaveFit-5 were contaminated
In contrast, WaveFit denoises an intermediate signal without adding
by pulsive artifacts. When we trained WaveFit using clean dataset
random noise. Therefore, noise reduction level at each iteration can
recorded in an anechoic chamber (dataset used in the experiments
be small. This characteristics allow WaveFit to achieve higher audio
of Sec. 5.4), such artifacts were not observed. In contrast, the
quality with less iterations. We provide intermediate output exam-
target waveform used in this experiment was not totally clean but
ples of these models in our demo page.2
contained some noise, which resulted in the erroneous output sam-
ples. This result indicates that WaveFit models might not be robust
5.4. Comparison with WaveRNN and DDPM-based models against noise and reverberation in the training dataset. We used the
The MOS and RTF results and the SxS results using the 1,000 eval- SpecGrad architecture from [46] both for WaveFit and DDPM-based
uation samples are shown in Tables 1 and 2, respectively. In all three models because we considered that the DDPM-based models are di-
iteration models, WaveFit produced better quality than both Spec- rect competitor of WaveFit and that using the same architecture pro-
Grad [46] and InferGrad [49]. As InferGrad used the same network vides a fair comparison. After we realized the superiority of WaveFit
architecture and adversarial loss as WaveFit, the main differences over the other DDPM-based models, we performed comparison with
between them are (i) whether to compute loss value for all interme- GAN-based models, and hence the model architecture of WaveFit in
diate outputs, and (ii) whether to add random noise at each iteration. the current version is not so sophisticated compared to GAN-based
These MOS and SxS results indicate that fixed-point iteration is a models. Indeed, WaveFit-1 is significantly worse than GAN-based
better strategy than DDPM for iterative-style neural vocoders. models, which can be heard form our audio demo.2 There is a lot of
InferGrad-3 was significantly better than SpecGrad-3. The dif- room for improvement in the performance and robustness of Wave-
ference between InferGrad-3 and SpecGrad-3 is the use of λIF only. Fit by seeking a proper architecture for F, which is left as a future
This result suggests that the hypothesis in Sec. 4.3, the mapping in work.
DDPM-based neural vocoders only focuses on removing random
components in input signals rather than moving the input signals 6. CONCLUSION
towards the target, is supported. Therefore, incorporating the differ-
ence between generated and target speech into the loss function of This paper proposed WaveFit, which integrates the essence of GANs
iterative-style neural vocoders is a promising approach. into a DDPM-like iterative framework based on fixed-point iteration.
On the RTF comparison, WaveFit models were slightly faster WaveFit iteratively denoises an input signal like DDPMs while not
than SpecGrad and InferGrad models with the same number of iter- adding random noise at each iteration. This strategy was realized by
ations. This is because DDPM-based models need to sample a noise training a DNN using a loss inspired by the concept of the fixed-point
waveform at each iteration, whereas WaveFit requires it only at the theory. The subjective listening experiments showed that WaveFit
first iteration. can generate a speech waveform whose quality is better than con-
Although WaveFit-3 was worse than WaveRNN, WaveFit-5 ventional DDPM models. We also showed that the quality achieved
achieved the naturalness comparable to WaveRNN and human nat- by WaveFit with five iterations was comparable to WaveRNN and
ural speech; there were no significant differences in the SxS tests human natural speech, while its inference speed was more than 240
with α = 0.01. We would like to highlight that (i) the inference times faster than WaveRNN.
7. REFERENCES [18] J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: High-fidelity de-
noising and dereverberation based on speech deep features in
[1] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, and adversarial networks,” in Proc. Interspeech, 2020.
J. Sotelo, “SampleRNN: An unconditional end-to-end neural
[19] ——, “HiFi-GAN-2: Studio-quality speech enhancement via
audio,” in Proc. ICLR, 2018.
generative adversarial networks conditioned on acoustic fea-
[2] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and tures,” in Proc. IEEE WASPAA, 2021.
T. Toda, “Speaker-dependent WaveNet vocoder.” in Proc. In-
[20] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. L. Wang, C. Huang, and
terspeech, 2017.
Y. Wang, “VoiceFixer: Toward general speech restoration with
[3] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A neural vocoder,” arXiv:2109.13731, 2021.
flow-based generative network for speech synthesis,” in Proc.
[21] T. Saeki, S. Takamichi, T. Nakamura, N. Tanji, and
ICASSP, 2019.
H. Saruwatari, “SelfRemaster: Self-supervised speech restora-
[4] W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A com- tion with analysis-by-synthesis approach using channel model-
pact flow-based model for raw audio,” in Proc. ICML, 2020. ing,” in Proc. Interspeech, 2022.
[5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, [22] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg,
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Q. Wang, and T. C. Walters, “WaveNet based low rate speech
Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS synthesis by coding,” in Proc. ICASSP, 2018.
conditioning WaveNet on mel spectrogram predictions,” in
[23] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and
Proc. ICASSP, 2018.
K. Tokuda, “WaveNet-based zero-delay lossless speech cod-
[6] J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and ing,” in Proc. SLT, 2018.
Y. Wu, “Non-attentive Tacotron: Robust and controllable neu-
[24] J.-M. Valin and J. Skoglund, “A real-time wideband neural
ral TTS synthesis including unsupervised duration modeling,”
vocoder at 1.6kb/s using LPCNet,” in Proc. Interspeech, 2019.
arXiv:2010.04301, 2020.
[25] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and
[7] I. Elias, H. Zen, J. Shen, Y. Zhang, Y. Jia, R. J. Weiss, and
M. Tagliasacchi, “SoundStream: An end-to-end neural audio
Y. Wu, “Parallel Tacotron: Non-autoregressive and control-
codec,” IEEE/ACM Trans. Audio, Speech and Lang. Proc.,
lable TTS,” in Proc. ICASSP, 2021.
2022.
[8] Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT:
[26] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
Augmented BERT on phonemes and graphemes for neural
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
TTS,” in Proc. Interspeech, 2021.
K. Kavukcuoglu, “WaveNet: A generative model for raw au-
[9] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. dio,” arXiv:1609.03499, 2016.
Liu, “FastSpeech: Fast, robust and controllable text to speech,”
[27] N. Kalchbrenner, W. Elsen, K. Simonyan, S. Noury,
in Proc. NeurIPS, 2019.
N. Casagrande, W. Lockhart, F. Stimberg, A. van den Oord,
[10] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio syn-
Liu, “FastSpeech 2: Fast and high-quality end-to-end text to thesis,” in Proc. ICML, 2018.
speech,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021.
[28] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural
[11] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview speech synthesis through linear prediction,” in Proc. ICASSP,
of voice conversion and its challenges: From statistical mod- 2019.
eling to deep learning,” IEEE/ACM Trans. Audio Speech Lang.
[29] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan,
Process., 2021.
O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lock-
[12] W.-C. Huang, S.-W. Yang, T. Hayashi, and T. Toda, “A com- hart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe,
parative study of self-supervised speech representation based S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen,
voice conversion,” EEE J. Sel. Top. Signal Process., 2022. A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis,
[13] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, “Parallel WaveNet: Fast high-fidelity speech synthesis.” in
Z. Chen, and Y. Wu, “Direct speech-to-speech translation with Proc. ICML, 2018.
a sequence-to-sequence model,” in Proc. Interspeech, 2019. [30] D. J. Rezende and S. Mohamed, “Variational inference with
[14] Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, normalizing flows,” in Proc. ICML, 2015.
“Translatotron 2: High-quality direct speech-to-speech trans- [31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
lation with voice preservation,” in Proc. ICML, 2022. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-
[15] A. Lee, P.-J. Chen, C. Wang, J. Gu, S. Popuri, X. Ma, versarial nets,” in Proc. NeurIPS, 2014.
A. Polyak, Y. Adi, Q. He, Y. Tang, J. Pino, and W.-N. Hsu, “Di- [32] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio
rect speech-to-speech translation with discrete units,” in Proc. synthesis,” in Proc. ICLR, 2019.
60th Annu. Meet. Assoc. Comput. Linguist. (Vol. 1: Long Pap.),
[33] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversar-
2022.
ial networks for efficient and high fidelity speech synthesis,” in
[16] S. Maiti and M. I. Mandel, “Parametric resynthesis with neural Proc. NeurIPS, 2020.
vocoders,” in Proc. IEEE WASPAA, 2019.
[34] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,
[17] ——, “Speaker independence of neural vocoders and their ef- J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville,
fect on parametric resynthesis speech enhancement,” in Proc. “MelGAN: Generative adversarial networks for conditional
ICASSP, 2020. waveform synthesis,” in Proc. Adv. Neural Inf. Process. Syst.
(NeurIPS), 2019.
[35] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: [52] P. L. Combettes and J.-C. Pesquet, “Fixed point strategies in
A fast waveform generation model based on generative adver- data science,” IEEE Trans. Signal Process., 2021.
sarial networks with multi-resolution spectrogram,” in Proc. [53] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis-
ICASSP, 2020. tic models,” in Proc. NeurIPS, 2020.
[36] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi- [54] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech en-
band MelGAN: Faster waveform generation for high-quality hancement in the waveform domain,” in Proc. Interspeech,
text-to-speech,” in Proc. IEEE SLT, 2021. 2020.
[37] J. You, D. Kim, G. Nam, G. Hwang, and G. Chae, “GAN [55] G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A. Bouman,
vocoder: Multi-resolution discriminator is all you need,” “Plug-and-play unplugged: Optimization-free reconstruction
arXiv:2103.05236, 2021. using consensus equilibrium,” SIAM J. Imaging Sci., 2018.
[38] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: [56] E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-
A neural vocoder with multi-resolution spectrogram discrim- and-play methods provably converge with properly trained de-
inators for high-fidelity waveform generation,” in Proc. Inter- noisers,” in Proc. ICML, 2019.
speech, 2021.
[57] J.-C. Pesquet, A. Repetti, M. Terris, and Y. Wiaux, “Learning
[39] T. Kaneko, K. Tanaka, H. Kameoka, and S. Seki, “iSTFTNet: maximally monotone operators for image recovery,” SIAM J.
Fast and lightweight mel-spectrogram vocoder incorporating Imaging Sci., 2021.
inverse short-time Fourier transform,” in Proc. ICASSP, 2022.
[58] Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and
[40] T. Bak, J. Lee, H. Bae, J. Yang, J.-S. Bae, and Y.-S. Joo,
N. Harada, “Deep Griffin-Lim iteration,” in Proc. ICASSP,
“Avocodo: Generative adversarial network for artifact-free
2019.
vocoder,” arXiv:2206.13404, 2022.
[59] ——, “Deep Griffin–Lim iteration: Trainable iterative phase
[41] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon,
reconstruction using neural network,” IEEE J. Sel. Top. Signal
“BigVGAN: A universal neural vocoder with large-scale train-
Process., 2021.
ing,” arXiv:2206.04658, 2022.
[42] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and [60] R. Cohen, M. Elad, and P. Milanfar, “Regularization by denois-
W. Chan, “WaveGrad: Estimating gradients for waveform gen- ing via fixed-point projection (RED-PRO),” SIAM J. Imaging
eration,” in Proc. ICLR, 2021. Sci., 2021.
[43] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diff- [61] H. H. Bauschke and P. L. Combettes, Convex Analysis and
Wave: A versatile diffusion model for audio synthesis,” in Monotone Operator Theory in Hilbert Spaces. Springer, 2017.
Proc. ICLR, 2021. [62] I. Yamada, M. Yukawa, and M. Yamagishi, Minimizing the
[44] M. W. Y. Lam, J. Wang, D. Su, and D. Yu, “BDDM: Bilateral Moreau envelope of nonsmooth convex functions over the fixed
denoising diffusion models for fast and high-quality speech point set of certain quasi-nonexpansive mappings. Springer,
synthesis,” in Proc. ICLR, 2022. 2011, pp. 345–390.
[45] S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, [63] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends
W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving con- Optim., 2014.
ditional denoising diffusion models with data-dependent adap- [64] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek, “SEANet:
tive prior,” in Proc. ICLR, 2022. A multi-modal speech enhancement network,” in Proc. Inter-
[46] Y. Koizumi, H. Zen, K. Yatabe, N. Chen, and M. Bacchi- speech, 2020.
ani, “SpecGrad: Diffusion probabilistic model based neural [65] S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning
vocoder with adaptive noise spectral shaping,” in Proc. Inter- in a world of projections,” IEEE Signal Process. Mag., 2011.
speech, 2022.
[66] T. Hayashi, “Parallel WaveGAN implementation with Py-
[47] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Noise level torch,” github.com/kan-bayashi/ParallelWaveGAN.
limited sub-modeling for diffusion probabilistic vocoders,” in
Proc. ICASSP, 2021. [67] H. Zen, R. Clark, R. J. Weiss, V. Dang, Y. Jia, Y. Wu, Y. Zhang,
and Z. Chen, “LibriTTS: A corpus derived from LibriSpeech
[48] K. Goel, A. Gu, C. Donahue, and C. Ré, “It’s Raw! audio for text-to-speech,” in Proc. Interspeech, 2019.
generation with state-space models,” arXiv:2202.09729, 2022.
[68] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic
[49] Z. Chen, X. Tan, K. Wang, S. Pan, D. Mandic, L. He, and optimization,” in Proc. ICLR, 2015.
S. Zhao, “InferGrad: Improving diffusion models for vocoder
by considering inference in training,” in Proc. ICASSP, 2022. [69] A. A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek,
and N. Kalchbrenner, “A spectral energy distance for parallel
[50] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative speech synthesis,” in Proc. NeurIPS, 2020.
learning trilemma with denoising diffusion GANs,” in Proc.
ICLR, 2022. [70] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Gan-
guli, “Deep unsupervised learning using nonequilibrium ther-
[51] S. Liu, D. Su, and D. Yu, “DiffGAN-TTS: High-fidelity
modynamic,” in Proc. ICML, 2015.
and efficient text-to-speech with denoising diffusion GANs,”
arXiv:2201.11972, 2022.