0% found this document useful (0 votes)
17 views8 pages

Wavefit: An Iterative and Non-Autoregressive Neural Vocoder Based On Fixed-Point Iteration

WAVEFIT: AN ITERATIVE AND NON-AUTOREGRESSIVE NEURAL VOCODER BASED ON FIXED-POINT ITERATION

Uploaded by

kongdw8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Wavefit: An Iterative and Non-Autoregressive Neural Vocoder Based On Fixed-Point Iteration

WAVEFIT: AN ITERATIVE AND NON-AUTOREGRESSIVE NEURAL VOCODER BASED ON FIXED-POINT ITERATION

Uploaded by

kongdw8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

WAVEFIT: AN ITERATIVE AND NON-AUTOREGRESSIVE NEURAL VOCODER

BASED ON FIXED-POINT ITERATION

Yuma Koizumi1 , Kohei Yatabe2 , Heiga Zen1 , Michiel Bacchiani1


1 2
Google Research, Japan Tokyo University of Agriculture and Technology, Japan

(a) DDPM-based models (b) GAN-based models (c) WaveFit


ABSTRACT
Log-mel spectrogram c Log-mel spectrogram c Log-mel spectrogram c
<latexit sha1_base64="Zv18Jv5MgC4Mxrz/CC1v3efY/dE=">AAACkXichVE9S8NQFD2N3/WjVRfBpVgUp3Ir4tdUdBFcbGtroYok8amx+SJJC7X4B5zcRJ0UHMSf4OjiH3DoTxDHCi4O3qQBUVFvSN55551zXw5XsXXN9YiaEamjs6u7p7cv2j8wOBSLD48UXavqqKKgWrrllBTZFbpmioKneboo2Y6QDUUXm0plxT/frAnH1Sxzw6vbYtuQ901tT1Nlj6nilmI01OOdeJJSFFTiJ0iHIImw1q34PbawCwsqqjAgYMJjrEOGy08ZaRBs5rbRYM5hpAXnAseIsrfKKsEKmdkKf/d5Vw5Zk/d+Tzdwq3yLzq/DzgQm6YluqUWPdEfP9P5rr0bQw/+XOq9K2yvsndjJWP7tX5fBq4eDT9cfDoXV/+v8bB72sBBk0jijHTB+WrXtrx2dtfJLucnGFF3TC+e8oiY9cFKz9qreZEXuElEeVPr7WH6C4kwqPZei7GwysxyOrBfjmMA0z2UeGaxiHQW+9xCnOMeFNCotShkp1EqR0DOKLyWtfQCLJJPo</latexit>

<latexit sha1_base64="Zv18Jv5MgC4Mxrz/CC1v3efY/dE=">AAACkXichVE9S8NQFD2N3/WjVRfBpVgUp3Ir4tdUdBFcbGtroYok8amx+SJJC7X4B5zcRJ0UHMSf4OjiH3DoTxDHCi4O3qQBUVFvSN55551zXw5XsXXN9YiaEamjs6u7p7cv2j8wOBSLD48UXavqqKKgWrrllBTZFbpmioKneboo2Y6QDUUXm0plxT/frAnH1Sxzw6vbYtuQ901tT1Nlj6nilmI01OOdeJJSFFTiJ0iHIImw1q34PbawCwsqqjAgYMJjrEOGy08ZaRBs5rbRYM5hpAXnAseIsrfKKsEKmdkKf/d5Vw5Zk/d+Tzdwq3yLzq/DzgQm6YluqUWPdEfP9P5rr0bQw/+XOq9K2yvsndjJWP7tX5fBq4eDT9cfDoXV/+v8bB72sBBk0jijHTB+WrXtrx2dtfJLucnGFF3TC+e8oiY9cFKz9qreZEXuElEeVPr7WH6C4kwqPZei7GwysxyOrBfjmMA0z2UeGaxiHQW+9xCnOMeFNCotShkp1EqR0DOKLyWtfQCLJJPo</latexit> <latexit sha1_base64="Zv18Jv5MgC4Mxrz/CC1v3efY/dE=">AAACkXichVE9S8NQFD2N3/WjVRfBpVgUp3Ir4tdUdBFcbGtroYok8amx+SJJC7X4B5zcRJ0UHMSf4OjiH3DoTxDHCi4O3qQBUVFvSN55551zXw5XsXXN9YiaEamjs6u7p7cv2j8wOBSLD48UXavqqKKgWrrllBTZFbpmioKneboo2Y6QDUUXm0plxT/frAnH1Sxzw6vbYtuQ901tT1Nlj6nilmI01OOdeJJSFFTiJ0iHIImw1q34PbawCwsqqjAgYMJjrEOGy08ZaRBs5rbRYM5hpAXnAseIsrfKKsEKmdkKf/d5Vw5Zk/d+Tzdwq3yLzq/DzgQm6YluqUWPdEfP9P5rr0bQw/+XOq9K2yvsndjJWP7tX5fBq4eDT9cfDoXV/+v8bB72sBBk0jijHTB+WrXtrx2dtfJLucnGFF3TC+e8oiY9cFKz9qreZEXuElEeVPr7WH6C4kwqPZei7GwysxyOrBfjmMA0z2UeGaxiHQW+9xCnOMeFNCotShkp1EqR0DOKLyWtfQCLJJPo</latexit>


arXiv:2210.01029v1 [eess.AS] 3 Oct 2022

Denoising diffusion probabilistic models (DDPMs) and generative


adversarial networks (GANs) are popular generative models for neu-
ral vocoders. The DDPMs and GANs can be characterized by the
iterative denoising framework and adversarial training, respectively.
Noise generation Noise generation
This study proposes a fast and high-quality neural vocoder called yT
<latexit sha1_base64="IoBXzpB/i/H3onkl9rKzdMt4ZYs=">AAACk3ichVE9S8NQFD3Gr1q/qiIILmJRnMqtiIoupTq4CK21WmilJPFVQ/NFkhZq8Q84ujjoouAg/gRHF/+Agz9BHCu4OHiTBkRFvSHv3XfeOTc5HMXWNdcjeuqQOru6e3ojfdH+gcGh4djI6I5r1RxV5FVLt5yCIrtC10yR9zRPFwXbEbKh6GJXqa7597t14biaZW57DVvsGfKBqVU0VfYYKpQUo9k4Lm+XY3FKUFBTP5tk2MQRVsaK3aGEfVhQUYMBARMe9zpkuPwUkQTBZmwPTcYc7rTgXuAYUdbWmCWYITNa5fWAT8UQNfnsz3QDtcpf0fl1WDmFGXqkG2rRA93SM73/OqsZzPD/pcG70tYKuzx8MpF7+1dl8O7h8FP1h0Jh9v8835uHCpYDTxp7tAPEd6u29fWjs1ZuZWumOUtX9MI+L+mJ7tmpWX9Vr7Ni6xxRDir5PZafzc58IrmYoOxCPJUOI4tgEtOY41yWkMIGMsgHeZziHBfSuLQqpaX1NlXqCDVj+FLS5geNfZTF</latexit>

x0
<latexit sha1_base64="vkOcqP3l5iv4PNLUqQWZXpBdbVs=">AAAClXichVHLSsNQFJzGV62vqgsFN2JRXJVTERVBKCriTqu2ig9KEq8azIskLWroD7gWXIiCggvxE1y68Qdc+AnisoIbF56kAVFRT0ju3Lkz52Y4iq1rrkf0FJPq6hsam+LNiZbWtvaOZGdXwbVKjiryqqVbzpoiu0LXTJH3NE8Xa7YjZEPRxaqyPxOcr5aF42qWueId2mLLkHdNbUdTZY+p9U3F8A8qRZ8qxWSK0hRW/0+QiUAKUS1ayTtsYhsWVJRgQMCEx1iHDJefDWRAsJnbgs+cw0gLzwUqSLC3xCrBCpnZff7u8m4jYk3eBz3d0K3yLTq/Djv7MUiPdENVeqBbeqb3X3v5YY/gXw55VWpeYRc7jnuX3/51Gbx62Pt0/eFQWP2/LsjmYQcTYSaNM9ohE6RVa/7y0Wl1eXJp0B+iK3rhnJf0RPec1Cy/qtc5sXSGBA8q830sP0FhJJ0ZS1NuNJWdjkYWRx8GMMxzGUcW81hEnu81cYJzXEg90pQ0K83VpFIs8nTjS0kLH6bYlaw=</latexit>

yT
<latexit sha1_base64="IoBXzpB/i/H3onkl9rKzdMt4ZYs=">AAACk3ichVE9S8NQFD3Gr1q/qiIILmJRnMqtiIoupTq4CK21WmilJPFVQ/NFkhZq8Q84ujjoouAg/gRHF/+Agz9BHCu4OHiTBkRFvSHv3XfeOTc5HMXWNdcjeuqQOru6e3ojfdH+gcGh4djI6I5r1RxV5FVLt5yCIrtC10yR9zRPFwXbEbKh6GJXqa7597t14biaZW57DVvsGfKBqVU0VfYYKpQUo9k4Lm+XY3FKUFBTP5tk2MQRVsaK3aGEfVhQUYMBARMe9zpkuPwUkQTBZmwPTcYc7rTgXuAYUdbWmCWYITNa5fWAT8UQNfnsz3QDtcpf0fl1WDmFGXqkG2rRA93SM73/OqsZzPD/pcG70tYKuzx8MpF7+1dl8O7h8FP1h0Jh9v8835uHCpYDTxp7tAPEd6u29fWjs1ZuZWumOUtX9MI+L+mJ7tmpWX9Vr7Ni6xxRDir5PZafzc58IrmYoOxCPJUOI4tgEtOY41yWkMIGMsgHeZziHBfSuLQqpaX1NlXqCDVj+FLS5geNfZTF</latexit>

x0
<latexit sha1_base64="vkOcqP3l5iv4PNLUqQWZXpBdbVs=">AAAClXichVHLSsNQFJzGV62vqgsFN2JRXJVTERVBKCriTqu2ig9KEq8azIskLWroD7gWXIiCggvxE1y68Qdc+AnisoIbF56kAVFRT0ju3Lkz52Y4iq1rrkf0FJPq6hsam+LNiZbWtvaOZGdXwbVKjiryqqVbzpoiu0LXTJH3NE8Xa7YjZEPRxaqyPxOcr5aF42qWueId2mLLkHdNbUdTZY+p9U3F8A8qRZ8qxWSK0hRW/0+QiUAKUS1ayTtsYhsWVJRgQMCEx1iHDJefDWRAsJnbgs+cw0gLzwUqSLC3xCrBCpnZff7u8m4jYk3eBz3d0K3yLTq/Djv7MUiPdENVeqBbeqb3X3v5YY/gXw55VWpeYRc7jnuX3/51Gbx62Pt0/eFQWP2/LsjmYQcTYSaNM9ohE6RVa/7y0Wl1eXJp0B+iK3rhnJf0RPec1Cy/qtc5sXSGBA8q830sP0FhJJ0ZS1NuNJWdjkYWRx8GMMxzGUcW81hEnu81cYJzXEg90pQ0K83VpFIs8nTjS0kLH6bYlaw=</latexit>

WaveFit, which integrates the essence of GANs into a DDPM-like G(Id F)


<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>

| yt , c)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>

p(yt
GAN GAN
1

iterative framework based on fixed-point iteration. WaveFit itera- yt


<latexit sha1_base64="OspnX+yHSbX7qkurs5eErJTXzw8=">AAACk3ichVE9S8NQFD3G7/pVFUFwEUvFqdyKqOhS1MFF0Go/oEpJ4qsG80XyWqjFP+Do4qCLgoP4Exxd/AMO/gRxVHBx8CYNiIp6Q96777xzbnI4mmsaviR6bFFa29o7Oru6Yz29ff0D8cGhvO9UPV3kdMd0vKKm+sI0bJGThjRF0fWEammmKGgHy8F9oSY833DsLVl3xY6l7tlGxdBVyVBxW7Ma9aOyLMcTlKKwxn826ahJIKp1J36LbezCgY4qLAjYkNybUOHzU0IaBJexHTQY87gzwnuBI8RYW2WWYIbK6AGve3wqRajN52CmH6p1/orJr8fKcSTpga7phe7php7o/ddZjXBG8C913rWmVrjlgePRzbd/VRbvEvufqj8UGrP/5wXeJCqYDz0Z7NENkcCt3tTXDk9fNheyycYkXdIz+7ygR7pjp3btVb/aENkzxDio9PdYfjb56VR6NkUbM4nMUhRZF8YwgSnOZQ4ZrGIduTCPE5zhXBlRFpUlZaVJVVoizTC+lLL2AdJ9lOU=</latexit>

MMSE
Loss
yt
<latexit sha1_base64="OspnX+yHSbX7qkurs5eErJTXzw8=">AAACk3ichVE9S8NQFD3G7/pVFUFwEUvFqdyKqOhS1MFF0Go/oEpJ4qsG80XyWqjFP+Do4qCLgoP4Exxd/AMO/gRxVHBx8CYNiIp6Q96777xzbnI4mmsaviR6bFFa29o7Oru6Yz29ff0D8cGhvO9UPV3kdMd0vKKm+sI0bJGThjRF0fWEammmKGgHy8F9oSY833DsLVl3xY6l7tlGxdBVyVBxW7Ma9aOyLMcTlKKwxn826ahJIKp1J36LbezCgY4qLAjYkNybUOHzU0IaBJexHTQY87gzwnuBI8RYW2WWYIbK6AGve3wqRajN52CmH6p1/orJr8fKcSTpga7phe7php7o/ddZjXBG8C913rWmVrjlgePRzbd/VRbvEvufqj8UGrP/5wXeJCqYDz0Z7NENkcCt3tTXDk9fNheyycYkXdIz+7ygR7pjp3btVb/aENkzxDio9PdYfjb56VR6NkUbM4nMUhRZF8YwgSnOZQ4ZrGIduTCPE5zhXBlRFpUlZaVJVVoizTC+lLL2AdJ9lOU=</latexit>

Loss
G(Id F)
<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>

| yt , c)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>

tively denoises an input signal, and trains a deep neural network p(yt 1
yt
<latexit sha1_base64="feNpNBS1NA6fRrK97NgwcBF5KBw=">AAACl3ichVHNLsRQGD3qf/wNNsRGTIiNyVcRxIaQiKXBIDEyaeui0b+0dyYZzbyAB2Bh4SexEI9gaeMFLDyCWJLYWPjaaSIIvqa95557znd78umeZQaS6LFOqW9obGpuaU21tXd0dqW7e9YDt+QbIm+4lutv6logLNMReWlKS2x6vtBs3RIb+sFCdL5RFn5gus6arHhi29b2HHPXNDTJVKGg22GlWgzlmFotpjOUpbgGfwI1ARkkteymb1HADlwYKMGGgAPJ2IKGgJ8tqCB4zG0jZM5nZMbnAlWk2FtilWCFxuwBf/d4t5WwDu+jnkHsNvgWi1+fnYMYpge6phe6pxt6ovdfe4Vxj+hfKrzqNa/wil1Hfatv/7psXiX2P11/OHRW/6+LsknsYjrOZHJGL2aitEbNXz48eVmdWRkOR+iSnjnnBT3SHSd1yq/GVU6snCLFg1K/j+UnWB/PqpNZyk1k5uaTkbVgAEMY5blMYQ5LWEae7/VwjDOcK/3KrLKoLNWkSl3i6cWXUnIfWkyWYw==</latexit>

1 yt
<latexit sha1_base64="feNpNBS1NA6fRrK97NgwcBF5KBw=">AAACl3ichVHNLsRQGD3qf/wNNsRGTIiNyVcRxIaQiKXBIDEyaeui0b+0dyYZzbyAB2Bh4SexEI9gaeMFLDyCWJLYWPjaaSIIvqa95557znd78umeZQaS6LFOqW9obGpuaU21tXd0dqW7e9YDt+QbIm+4lutv6logLNMReWlKS2x6vtBs3RIb+sFCdL5RFn5gus6arHhi29b2HHPXNDTJVKGg22GlWgzlmFotpjOUpbgGfwI1ARkkteymb1HADlwYKMGGgAPJ2IKGgJ8tqCB4zG0jZM5nZMbnAlWk2FtilWCFxuwBf/d4t5WwDu+jnkHsNvgWi1+fnYMYpge6phe6pxt6ovdfe4Vxj+hfKrzqNa/wil1Hfatv/7psXiX2P11/OHRW/6+LsknsYjrOZHJGL2aitEbNXz48eVmdWRkOR+iSnjnnBT3SHSd1yq/GVU6snCLFg1K/j+UnWB/PqpNZyk1k5uaTkbVgAEMY5blMYQ5LWEae7/VwjDOcK/3KrLKoLNWkSl3i6cWXUnIfWkyWYw==</latexit>

1
(DNN) for minimizing an adversarial loss calculated from interme- | yt , c) F G(Id F)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
<latexit sha1_base64="+802x5u8mV5scpnnoIFSdVyQLts=">AAAClnichVHLSsNQFBzjq9ZX1Y3gRiyKq3IqouJCRPGxrNWqUEWSeNsG8yK5LdTiD7gXF4Ki4EL8BJdu/AEX/QRxqeDGhSdpQFTUE5I7d+7MSSZHc03Dl0T1JqW5pbWtPdYR7+zq7ulN9PVv+k7Z00VOd0zH29ZUX5iGLXLSkKbYdj2hWpoptrSDxeB8qyI833DsDVl1xa6lFm2jYOiqZCq/Y6mypKtmbfloL5GkFIU1/BOkI5BEVBkncYcd7MOBjjIsCNiQjE2o8PnKIw2Cy9wuasx5jIzwXOAIcfaWWSVYoTJ7wM8i7/IRa/M+6OmHbp3fYvLtsXMYo/RIN/RCD3RLT/T+a69a2CP4liqvWsMr3L3e48H1t39dFq8SpU/XHw6N1f/rgmwSBcyEmQzO6IZMkFZv+CuHpy/rs9nR2hhd0TPnvKQ63XNSu/KqX6+J7BniPKj097H8BJsTqfRUitYmk/ML0chiGMIIxnku05jHKjLIhX/2BOe4UAaVOWVJWWlIlabIM4AvpWQ+AICdlgg=</latexit>

<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>

p(yt 1
diate outputs at all iterations. Subjective (side-by-side) listening tests y0
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>

Output y0
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>

Output y0
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>

Output
showed no statistically significant differences in naturalness between
human natural speech and those synthesized by WaveFit with five it-
erations. Furthermore, the inference speed of WaveFit was more
than 240 times faster than WaveRNN. Audio demos are available at
google.github.io/df-conformer/wavefit/. Fig. 1. Overview of (a) DDPM, (b) GAN-based model, and (c) pro-
Index Terms— Neural vocoder, fixed-point iteration, generative posed WaveFit. (a) DDPM is an iterative-style model, where sam-
adversarial networks, denoising diffusion probabilistic models. pling from the posterior is realized by adding noise to the denoised
intermediate signals. (b) GAN-based models predict y0 by a non-
iterative DNN F which is trained to minimize an adversarial loss
1. INTRODUCTION calculated from y0 and the target speech x0 . (c) Proposed WaveFit
is an iterative-style model without adding noise at each iteration, and
Neural vocoders [1–4] are artificial neural networks that generate a F is trained to minimize an adversarial loss calculated from all inter-
speech waveform given acoustic features. They are indispensable mediate signals yT −1 , . . . , y0 , where Id and G denote the identity
building blocks of recent applications of speech generation. For operator and a gain adjustment operator, respectively.
example, they are used as the backbone module in text-to-speech
(TTS) [5–10], voice conversion [11, 12], speech-to-speech trans-
Since a DDPM-based neural vocoder iteratively refines speech
lation (S2ST) [13–15], speech enhancement (SE) [16–19], speech
waveform, there is a trade-off between its sound quality and compu-
restoration [20, 21], and speech coding [22–25]. Autoregressive
tational cost [42], i.e., tens of iterations are required to achieve high-
(AR) models first revolutionized the quality of speech genera-
fidelity speech waveform. To reduce the number of iterations while
tion [1, 26–28]. However, as they require a large number of sequen-
maintaining the quality, existing studies of DDPMs have investigated
tial operations for generation, parallelizing the computation is not
the inference noise schedule [44], the use of adaptive prior [45, 46],
trivial thus their processing time is sometimes far longer than the
the network architecture [47, 48], and/or the training strategy [49].
duration of the output signals.
However, generating a speech waveform with quality comparable to
To speed up the inference, non-AR models have gained a lot human natural speech in a few iterations is still challenging.
of attention thanks to their parallelization-friendly model architec-
tures. Early successful studies of non-AR models are those based Recent studies demonstrated that the essence of DDPMs and
on normalizing flows [3, 4, 29] which convert an input noise to a GANs can coexist [50, 51]. Denoising diffusion GANs [50] use a
speech using stacked invertible deep neural networks (DNNs) [30]. generator to predict a clean sample from a diffused one and a dis-
In the last few years, the approach using generative adversarial net- criminator is used to differentiate the diffused samples from the clean
works (GANs) [31] is the most successful non-AR strategy [32–41] or predicted ones. This strategy was applied to TTS, especially to
where they are trained to generate speech waveforms indistinguish- predict a log-mel spectrogram given an input text [51]. As DDPMs
able from human natural speech by discriminator networks. The lat- and GANs can be combined in several different ways, there will be a
est member of the generative models for neural vocoders is the de- new combination which is able to achieve the high quality synthesis
noising diffusion probabilistic model (DDPM) [42–49], which con- with a small number of iterations.
verts a random noise into a speech waveform by the iterative sam- This study proposes WaveFit, an iterative-style non-AR neural
pling process as illustrated in Fig. 1 (a). With hundreds of iterations, vocoder, trained using a GAN-based loss as illustrated in Fig. 1 (c).
DDPMs can generate speech waveforms comparable to those of AR It is inspired by the theory of fixed-point iteration [52]. The pro-
models [42, 43]. posed model iteratively applies a DNN as a denoising mapping that
removes noise components from an input signal so that the out- 2.1.1. Prior adaptation from conditioning log-mel spectrogram
put becomes closer to the target speech. We use a loss that com-
bines a GAN-based [34] and a short-time Fourier transform (STFT)- To reduce the number of iterations in inference, PriorGrad [45] and
based [35] loss as this is insensitive to imperceptible phase differ- SpecGrad [46] introduced an adaptive prior N (0, Σ), where Σ is
ences. By combining the loss for all iterations, the intermediate out- computed from c. The use of an adaptive prior decreases the lower
put signals are encouraged to approach the target speech along with bound of the ELBO, and accelerates both training and inference [45].
the iterations. Subjective listening tests showed that WaveFit can SpecGrad [46] uses the fact that Σ is positive semi-definite and
generate a speech waveform whose quality was better than conven- that it can be decomposed as Σ = LL> where L ∈ RD×D and
>
tional DDPM models. The experiments also showed that the audio is the transpose. Then, sampling from N (0, Σ) can be written
quality of synthetic speech by WaveFit with five iterations is compa- as  = L˜  using ˜ ∼ N (0, I), and Eq. (2) with an adaptive prior
rable to those of WaveRNN [27] and human natural speech. becomes
2
LSG = L−1 ( − Fθ (xt , c, βt )) 2
. (4)
2. NON-AUTOREGRESSIVE NEURAL VOCODERS
SpecGrad [46] defines L = G+ M G and approximates L−1 ≈
A neural vocoder generates a speech waveform y0 ∈ R given a D
G+ M −1 G. Here N K × D matrix G represents the STFT, M =
log-mel spectrogram c = (c1 , ..., cK ) ∈ RF K , where ck ∈ RF diag[(m1,1 , . . . , mN,K )] ∈ CN K×N K is the diagonal matrix repre-
is an F -point log-mel spectrum at k-th time frame, and K is the senting the filter coefficients for each (n, k)-th time-frequency (T-F)
number of time frames. The goal is to develop a neural vocoder so bin, and G+ is the matrix representation of the inverse STFT (iSTFT)
as to generate y0 indistinguishable from the target speech x0 ∈ RD using a dual window. This means L and L−1 are implemented as
with less computations. This section briefly reviews two types of time-varying filters and its approximated inverse filters in the T-F
neural vocoders: DDPM-based and GAN-based ones. domain, respectively. The T-F domain filter M is obtained by the
spectral envelope calculated from c with minimum phase response.
2.1. DDPM-based neural vocoder The spectral envelope is obtained by applying the 24th order lifter to
the power spectrogram calculated from c.
A DDPM-based neural vocoder is a latent variable model of x0 as
q(x0 | c) based on a T -step Markov chain of xt ∈ RD with learned 2.1.2. InferGrad
Gaussian transitions, starting from q(xT ) = N (0, I), defined as
In conventional DDPM-models, since the DNNs have been trained
T
Z Y as a Gaussian denoiser using a simplified loss function as Eq. (2),
q(x0 | c) = q(xT ) q(xt−1 | xt , c) dx1 · · · dxT . (1) there is no guarantee that the generated speech becomes close to the
RDT t=1 target speech. To solve this problem, InferGrad [49] synthesizes y0
By modeling q(xt−1 | xt , c), y0 ∼ q(x0 | c) can be realized as a from a random signal  via Eq. (3) in every training step, then addi-
recursive sampling of yt−1 from q(yt−1 | yt , c). tionally minimizes an infer loss LIF which represents a gap between
In a DDPM-based neural vocoder, xt is generated by the dif- generated speech y0 and the target speech x0 . The loss function for
fusion process that gradually adds Gaussian noise to the wave- InferGrad is given as
form according√to a noise schedule  {β1 , ..., βT } given by p(xt | LIG = LWG + λIF LIF , (5)
xt−1 ) = N 1 − βt xt−1 , βt I . This formulation enables us
√ sample x
to √t at an arbitrary timestep t in a closed form Q as xt = where λIF > 0 is a tunable weight for the infer loss.
ᾱt x0 + 1 − ᾱt , where αt = 1 − βt , ᾱt = ts=1 αs , and
 ∼ N (0, I). As proposed by Ho et al. [53], DDPM-based neural 2.2. GAN-based neural vocoder
vocoders use a DNN F with parameter θ for predicting  from xt as
Another popular approach for non-AR neural vocoders is to adopt
ˆ = Fθ (xt , c, βt ). The DNN F can be trained by maximizing the
adversarial training; a neural vocoder is trained to generate a speech
evidence lower bound (ELBO), though most of DDPM-based neu-
waveform where discriminators cannot distinguish it from the target
ral vocoders use a simplified loss function which omits loss weights
speech, and discriminators are trained to differentiate between target
corresponding to iteration t;
and generated speech. In GAN-based models, a non-AR DNN F :
LWG = k − Fθ (xt , c, βt )k22 , (2) RF K → RD directly outputs y0 from c as y0 = Fθ (c).
One main research topic with GAN-based models is to design
where k·kp denotes the `p norm. Then, if βt is small enough, loss functions. Recent models often use multiple discriminators at
q(xt−1 | xt , c) can be given by N (µt , γt I), and the recursive sam- multiple resolutions [34]. One of the pioneering work of using mul-
pling from q(yt−1 | yt , c) can be realized by iterating the following tiple discriminators is MelGAN [34] which proposed the multi-scale
formula for t = T, . . . , 1 as discriminator (MSD). In addition, MelGAN uses a feature match-
  ing loss that minimizes the mean-absolute-error (MAE) between the
1 βt discriminator feature maps of target and generated speech. The loss
yt−1 = √ yt − √ Fθ (yt , c, βt ) + γt  (3)
αt 1 − ᾱt functions of the generator LGAN GAN
Gen and discriminator LDis of the GAN-
based neural vocoder are given as followings:
1−ᾱ
where γt = 1−ᾱt−1 βt , yT ∼ N (0, I) and γ1 = 0.
t RGAN
The first DDPM-based neural vocoders [42, 43] required over 1 X
LGAN
Gen = −Dr (y0 ) + λFM LFM
r (x0 , y0 ) (6)
200 iterations to match AR neural vocoders [26, 27] in naturalness RGAN r=1
measured by mean opinion score (MOS). To reduce the number of
RGAN
iterations while maintaining the quality, existing studies have inves- 1 X
tigated the use of noise prior distributions [45, 46] and/or better in- LGAN
Dis = max(0, 1−Dr (x0 ))+ max(0, 1+Dr (y0 )) (7)
RGAN r=1
ference noise schedules [44].
where RGAN is the number of discriminators and λFM ≥ 0 is a tunable An example of fixed-point iteration is the following proximal
weight for LFM . The r-th discriminator Dr : RD → R consists of H point algorithm which is a generalization of iterative refinement [63]:
sub-layers as Dr = DrH ◦ · · · ◦ Dr1 where Drh : RDh−1,r → RDh,r .
Then, the feature matching loss for the r-th discriminator is given by ξn+1 = proxL (ξn ), (13)
H−1
1 X 1 where proxL denotes the proximity operator of a loss function L,
LFM
r (x0 , y0 ) = kdhx,0 − dhy,0 k1 , (8)
H −1 Dh,r
h=1 h 1 i
proxL (ξ) ∈ arg min L(ζ) + kξ − ζk22 . (14)
ζ 2
where dha,b is the outputs of Drh−1 (ab ).
As an auxiliary loss function, a multi-resolution STFT loss is of- If L is proper lower-semicontinuous convex, then proxL is firmly
ten used to stabilize the adversarial training process [35]. A popular (quasi-) nonexpansive, and hence a sequence generated by the
multi-resolution STFT loss LMR-STFT consists of the spectral conver- proximal point algorithm converges to a point in Fix(proxL ) =
gence loss and the magnitude loss as [35, 36, 54]: arg minζ L(ζ), i.e., Eq. (13) minimizes the loss function L. Note
RSTFT
that Eq. (14) is a negative log-likelihood of maximum a posteriori
1 X estimation based on the Gaussian observation model with a prior
LMR-STFT (x0 , y0 ) = LSc Mag
r (x0 , y0 ) + Lr (x0 , y0 ), (9)
RSTFT proportional to exp(−L(·)). That is, Eq. (13) is an iterative Gaus-
r=1
sian denoising algorithm like the DDPM-based methods, which mo-
where RSTFT is the number of STFT configurations. LSc r and
tivates us to consider the fixed-point theory.
LMag
r correspond to the spectral convergence loss and the mag- The important property of a firmly quasi-nonexpansive map-
nitude loss of the r-th STFT configuration as LSc r (x0 , y0 ) =
ping is that it is attracting, i.e., equality in Eq. (11) never occurs:
kX0,r − Y0,r k2 /kX0,r k2 and LMag 1
r (x0 , y0 ) = Nr Kr kln(X0,r ) −
kT (ξ) − φk2 < kξ − φk2 . Hence, applying T always moves an
ln(Y0,r )k1 , where Nr and Kr are the numbers of frequency bins input signal ξ closer to a fixed point φ. In this paper, we consider
and time-frames of the r-th STFT configuration, respectively, and a denoising mapping as T that removes noise from an input signal,
X0,r ∈ RNr Kr and Y0,r ∈ RNr Kr correspond to the amplitude and let us consider an ideal situation. In this case, the fixed-point
spectrograms with the r-th STFT configuration of x0 and y0 . iteration in Eq. (12) is an iterative denoising algorithm, and the at-
State-of-the-art GAN-based neural vocoders [33] can achieve a tracting property ensures that each iteration always refines the sig-
quality nearly on a par with human natural speech. Recent stud- nal. It converges to a clean signal φ that does not contain any noise
ies showed that the essence of GANs can be incorporated into because a fixed point of the denoising mapping, T (φ) = φ, is a
DDPMs [50, 51]. As DDPMs and GANs can be combined in sev- signal that cannot be denoised anymore, i.e., no noise is remained.
eral different ways, there will be a new combination which is able to If we can construct such a denoising mapping specialized to speech
achieve the high quality synthesis with a small number of iterations. signals, then iterating Eq. (12) from any signal (including random
noise) gives a clean speech signal, which realizes a new principle of
neural vocoders.
3. FIXED-POINT ITERATION

Extensive contributions to data science have been made by fixed- 4. PROPOSED METHOD
point theory and algorithms [52]. These ideas have recently been
combined with DNN to design data-driven iterative algorithms [55– This section introduces the proposed iterative-style non-AR neural
60]. Our proposed method is also inspired by them, and hence fixed- vocoder, WaveFit. Inspired by the success of a combination of the
point theory is briefly reviewed in this section. fixed-point theory and deep learning in image processing [60], we
A fixed point of a mapping T is a point φ that is unchanged by adopt a similar idea for speech generation. As mentioned in the last
T , i.e., T (φ) = φ. The set of all fixed points of T is denoted as paragraph in Sec. 3, the key idea is to construct a DNN as a denois-
ing mapping satisfying Eq. (11). We propose a loss function which
φ ∈ RD | T (φ) = φ . approximately imposes this property in the training. Note that the

Fix(T ) = (10)
notations from Sec. 2 (e.g., x0 and yT ) are used in this section.
Let the mapping T be firmly quasi-nonexpansive [61], i.e., it satisfies
4.1. Model overview
kT (ξ) − φk2 ≤ kξ − φk2 (11)
The proposed model iteratively applies a denoising mapping to refine
for every ξ ∈ RD and every φ ∈ Fix(T ) (6= ∅), and there ex- yt so that yt−1 is closer to x0 . By iterating the following procedure
ists a quasi-nonexpansive mapping F that satisfies T = 21 Id + 12 F, T times, WaveFit generates a speech signal y0 :
where Id denotes the identity operator. Then, for any initial point,
the following fixed-point iteration converges to a fixed point of T :1 yt−1 = G (zt , c) , zt = yt − Fθ (yt , c, t) (15)

ξn+1 = T (ξn ). (12) where Fθ : RD → RD is a DNN trained to estimate noise com-


ponents, yT ∼ N (0, Σ), and Σ is given by the initializer of Spec-
That is, by iterating Eq. (12) from an initial point ξ0 , we can find a Grad [46]. G (z, c) : RD → RD is a gain adjustment operator that
fixed point of T depending on the choice of ξ0 . adjusts the signal power of zt to that of the target signal defined by
c. Specifically, the target power Pc is calculated from the power
1 To be precise, Id −T must be demiclosed at 0. Note that we showed the
spectrogram calculated from c. Then, the power of zt is calculated
fixed-point iteration in a very limited form, Eq. (12), because it is sufficient as Pz , and the gain of zt is adjusted as yt = (Pc /(Pz +s)) zt where
for explaining our motivation. For more general theory, see, e.g., [60–62]. s = 10−8 is a scalar to avoid zero-division.
4.2. Loss function We propose WaveFit because of the following two observations.
First, addition of random noise at each iteration disturbs the direc-
WaveFit can obtain clean speech from random noise yT by tion that a DDPM-based vocoder should proceed. The intermediate
fixed-point iteration if the mapping G(Id −Fθ ) is a firmly quasi- signals randomly changes their phase, which results in some artifact
nonexpansive mapping as described in Sec. 3. Although it is dif- in the higher frequency range due to phase distortion. Second, a
ficult to guarantee a DNN-based function to be a firmly quasi- DDPM-based vocoder without noise addition generates notable arti-
nonexpansive mapping in general, we design the loss function to facts that are not random, for example, a sine-wave-like artifact. This
approximately impose this property. fact indicates that a trained mapping of a DDPM-based vocoder does
The most important property of a firmly quasi-nonexpansive not move an input signal toward the target speech signal; it only fo-
mapping is that an output signal yt−1 is always closer to x0 cuses on randomness. Therefore, DDPM-based approaches have a
than the input signal yt . To impose this property on the denois- fundamental limitation on reducing the number of iterations.
ing mapping, we combine loss values for all intermediate outputs In contrast, WaveFit denoises an intermediate signal without
y0 , y1 , . . . , yT −1 as follows: adding random noise. Furthermore, the training strategy realized by
T −1 Eq. (16) faces the direction of denoising at each iteration toward the
1 X WF target speech. Hence, WaveFit can steadily improve the sound qual-
LWaveFit = L (x0 , yt ). (16)
T t=0 ity by the iterative denoising. These properties of WaveFit allow us
to reduce the number of iterations while maintaining sound quality.
The loss function LWF is designed based on the following two de- Note that, although the computational cost of one iteration of
mands: (i) the output waveform must be a high-fidelity signal; and our WaveFit model is almost identical to that of DDPM-based mod-
(ii) the loss function should be insensitive to imperceptible phase els, training WaveFit requires more computations than DDPM-based
difference. The reason for second demand is as follow: the fixed- and GAN-based models. This is because the loss function in Eq. (16)
points of a DNN corresponding to a conditioning log-mel spectro- consists of a GAN-based loss function for all intermediate outputs.
gram possibly include multiple waveforms, because there are count- Obviously, computing a GAN-based loss function, e.g. Eq. (6), re-
less waveforms corresponding to the conditioning log-mel spectro- quires larger computational costs than the mean-squared-error used
gram due to the difference in the initial phase. Therefore, phase in DDPM-based models as in Eq (2). In addition, WaveFit computes
sensitive loss functions, such as the squared-error, are not suitable a GAN-based loss function T times. Designing a less expensive loss
as LWF . Thus, we use a loss that combines GAN-based and multi- function that stably trains WaveFit models should be a future work.
resolution STFT loss functions as this is insensitive to imperceptible
phase differences: 4.4. Implementation

LWF (x0 , yt ) = LGAN


Gen (x0 , yt ) + λSTFT L
STFT
(x0 , yt ), (17) Network architecture: We use “WaveGrad Base model [42]” for
F , which has 13.8M trainable parameters. To compute the initial
where λSTFT ≥ 0 is a tunable weight. noise yT ∼ N (0, Σ), we follow the noise generation algorithm of
As the GAN-based loss function LGAN Gen , a combination of multi- SpecGrad [46].
resolution discriminators and feature-matching losses are adopted. For each Dr , we use the same architecture of that of Mel-
We slightly modified Eq. (6) to use a hinge loss as in SEANet [64]: GAN [34]. RGAN = 3 structurally identical discriminators are
applied to input audio at different resolutions (original, 2x down-
R
X sampled, and 4x down-sampled ones). Note that the number of logits
LGAN
Gen (x0 , yt ) = max(0, 1 − Dr (yt )) + λFM LFM
r (x0 , yt ). (18) in the output of Dr is more than one and proportional to the length
r=1
of the input. Thus, the averages of Eqs. (6) and (7) are used as loss
functions for generator and discriminator, respectively.
For the discriminator loss function, LGAN
Dis (x0 , yt ) in Eq. (7) is calcu-
lated and averaged over all intermediate outputs y0 , . . . , yT −1 . Hyper parameters: We assume that all input signals are up- or
For LSTFT , we use LMR-STFT because several studies showed that down-sampled to 24 kHz. For c, we use F = 128-dimensional
this loss function can stabilize the adversarial training [35, 36, 54]. log-mel spectrograms, where the lower and upper frequency bound
In addition, we use MAE loss between amplitude mel-spectrograms of triangular mel-filterbanks are 20 Hz and 12 kHz, respectively. We
of the target and generated speech as used in HiFi-GAN [33] and use the following STFT configurations for mel-spectrogram compu-
VoiceFixer [20]. Thus, LSTFT is given by tation and the initial noise generation algorithm; 50 ms Hann win-
dow, 12.5 ms frame shift, and 2048-point FFT, respectively.
1 For LMR-STFT , we use RSTFT = 3 resolutions, as well as conven-
LSTFT (x0 , yt ) = LMR-STFT (x0 , yt ) + kX0Mel − YtMel k1 , (19)
FK tional studies [35, 36, 54]. The Hann window size, frame shift, and
FFT points of each resolution are [360, 900, 1800], [80, 150, 300],
where X0Mel ∈ RF K and YtMel ∈ RF K denotes the amplitude mel- and [512, 1024, 2048], respectively. For the MAE loss of amplitude
spectrograms of x0 and yt , respectively. mel-spectrograms, we extract a 128-dimensional mel-spectrogram
with the second STFT configuration.
4.3. Differences between WaveFit and DDPM-based vocoders
Here, we discuss some important differences between the conven- 5. EXPERIMENT
tional and proposed neural vocoders. The conceptual difference is
obvious; DDPM-based models are derived using probability the- We evaluated the performance of WaveFit via subjective listening
ory, while WaveFit is inspired by fixed-point theory. Since fixed- experiments. In the following experiments, we call a WaveFit with
point theory is a key tool for deterministic analysis of optimization T iteration as “WaveFit-T ”. We used SpecGrad [46] and Infer-
and adaptive filter algorithms [65], WaveFit can be considered as an Grad [49] as baselines of the DDPM-based neural vocoders, Multi-
optimization-based or adaptive-filter-based neural vocoder. band (MB)-MelGAN [36] and HiFi-GAN V 1 [33] as baselines of the
GAN-based ones, and WaveRNN [27] as a baseline of the AR one.
Since the output quality of GAN-based models is highly affected by
hyper-parameters, we used well-tuned open source implementations
of MB-MelGAN and HiFi-GAN published in [66]. Audio demos are
available in our demo page.2

5.1. Experimental settings

Datasets: We trained the WaveFit, DDPM-based and AR-based


baselines with a proprietary speech dataset which consisted of 184
hours of high quality US English speech spoken by 11 female and
10 male speakers at 24 kHz sampling. For subjective tests, we used
Fig. 2. Log-magnitude absolute error (left) and spectral convergence
1,000 held out samples from the same proprietary speech dataset.
(right) of all intermediate outputs of SpecGrad-5, InferGrad-5, and
To compare WaveFit with GAN-based baselines, the LibriTTS
WaveFit-5. Lower is better for both metrics. Solid lines mean the
dataset [67] was used. We trained a WaveFit-5 model from the com-
average, and colored area denote the standard-deviation. The scale
bination of the “train-clean-100” and “train-clean-360” subsets at 24
of y-axis is log-scale.
kHz sampling. For subjective tests, we used randomly selected 1,000
samples from the “test-clean-100” subset. Synthesized speech wave-
forms for the “test-clean-100” subset published at [66] were used as For SpecGrad [46], we used WG-50 and WG-3 noise schedules
synthetic speech samples for the GAN-based baselines. The file- for training and inference, respectively. We used the same setups as
name list used in the listening test is available in our demo page.2 [46] for other settings except the removal of the generalized energy
Model and training setup: We trained all models using 128 Google distance (GED) loss [69] as we observed no impact on the quality.
TPU v3 cores with a global batch size of 512. To accelerate training, We tested InferGrad [49] with two, three and five iterations. The
we randomly picked up 120 frames (1.5 seconds, D = 36, 000 sam- noise generation process, loss function, and noise schedules were
ples) as input. We trained WaveFit-2 and WaveFit-3 for 500k steps, modified from the original paper [49] as we could not achieve the
and WaveFit-5 for 250k steps with the optimizer setting same as that reasonable quality with the setup from the paper. We used the noise
of WaveGrad [42]. The details of the baseline models are described generation process of SpecGrad [46]. For loss function, we used LSG
in Sec. 5.2. instead of LWG , and used the WaveFit loss LWF (x0 , y0 ) described in
For the proprietary dataset, weights of each loss were λFM = 100 Eq. (17) as the infer loss LIF . Furthermore, according to [49], the
and λSTFT = 1 based on the hyper-parameters of SEANet [64]. For weight for λIF was 0.1. We used the WG-50 noise schedule [46] for
the LibriTTS, based on MelGAN [34] and MB-MelGAN [36] set- training. For inference with each iteration, we used the noise sched-
tings, we used λFM = 10 and λSTFT = 2.5, and excluded the MAE ule with [3. E -04, 9. E -01], [3. E -04, 6. E -02, 9. E -01], and [1.0 E -
loss between amplitude mel-spectrograms of the target and gener- 04, 2.1 E -03, 2.8 E -02, 3.5 E -01, 7.0 E -01], respectively, because
ated waveforms. the output quality was better than the schedules used in the original
Metrics: To evaluate subjective quality, we rated speech naturalness paper [49]. As described in [49], we initialized the InferGrad model
through MOS and side-by-side (SxS) preference tests. The scale of from a pre-trained checkpoint of SpecGrad (1M steps) then finetuned
MOS was a 5-point scale (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Ex- it for additional 250k steps. The learning rate was 5.0 × 10−5 .
cellent) with rating increments of 0.5, and that of SxS was a 7-point GAN-based models: We used “checkpoint-1000000steps” from
scale (-3 to 3). Subjects were asked to rate the naturalness of each “libritts multi band melgan.v2” for MB-MelGAN [36], and
stimulus after listening to it. Test stimuli were randomly chosen and “checkpoint-2500000steps” from “libritts hifigan.v1” for HiFi-
presented to subjects in isolation, i.e., each stimulus was evaluated GAN [33], respectively. These samples were stored in Google Drive
by one subject. Each subject was allowed to evaluate up to six stim- linked from [66] (June 17th, 2022, downloaded).
uli. The subjects were paid native English speakers in the United
States. They were requested to use headphones in a quiet room. 5.3. Verification experiments for intermediate outputs
In addition, we measured real-time factor (RTF) on an NVIDIA We first verified whether the intermediate outputs of WaveFit-5 were
V100 GPU. We generated 120k time-domain samples (5 seconds approaching to the target speech or not. We evaluated the spectral
waveform) 20 times, and evaluated the average RTF with 95 % con- convergence LSc Mag
r and the log-magnitude absolute error Lr for all in-
fidence interval. termediate outputs. The number of STFT resolutions was RSTFT = 3.
For each resolution, we used a different STFT configuration from
5.2. Details of baseline models
the loss function used in InferGrad and WaveFit: the Hann win-
WaveRNN [27]: The model consisted of a single long short- dow size, frame shift, and FFT points of each resolution were [240,
term memory layer with 1,024 hidden units, 5 convolutional lay- 480, 1,200], [48, 120, 240], and [512, 1024, 2048], respectively.
ers with 512 channels as the conditioning stack to process the mel- SpecGrad-5 and InferGrad-5 were also evaluated for comparison.
spectrogram features, and a 10-component mixture of logistic distri- Figure 2 shows the experimental results. We can see that (i) both
butions as its output layer. It had 18.2M trainable parameters. We metrics for WaveFit decay on each iteration, and (ii) WaveFit outputs
trained this model using the Adam optimizer [68] for 1M steps. The almost converge at three iterations. In both metrics, WaveFit-5 was
learning rate was linearly increased to 10−4 in the first 100 steps better than SpecGrad-5. Although the objective scores of WaveFit-
then exponentially decayed to 10−6 from 200k to 400k steps. 5 and InferGrad-5 were almost the same, we found the outputs of
DDPM-based models: For both models, we used the same network InferGrad-5 included small but noticeable artifacts. A possible rea-
architecture, optimizer and training settings with WaveFit. son for these artifacts is that the DDPM-based models with a few
iterations at inference needs to perform a large-level noise reduction
2 google.github.io/df-conformer/wavefit/ at each iteration. This doesn’t satisfy the small βt assumption of
Table 1. Real time factors (RTFs) and MOSs with their 95% confi- Table 3. Results of MOS and SxS tests on the LibriTTS dataset with
dence intervals. Ground-truth means human natural speech. their 95% confidence intervals. A positive SxS score indicates that
Method MOS (↑) RTF (↓) WaveFit-5 was preferred.
Method MOS (↑) SxS p-value
InferGrad-2 3.68 ± 0.07 0.030 ± 0.00008
WaveFit-2 4.13 ± 0.67 0.028 ± 0.0001 MB-MelGAN 3.37 ± 0.085 0.619 ± 0.087 0.0000
HiFi-GAN V 1 4.03 ± 0.070 0.023 ± 0.057 0.2995
SpecGrad-3 3.36 ± 0.08 0.046 ± 0.0018
Ground-truth 4.18 ± 0.067 −0.089 ± 0.052 0.0000
InferGrad-3 4.03 ± 0.07 0.045 ± 0.0004
WaveFit-3 4.33 ± 0.06 0.041 ± 0.0001 WaveFit-5 3.98 ± 0.072 − −
InferGrad-5 4.37 ± 0.06 0.072 ± 0.0001
WaveRNN 4.41 ± 0.05 17.3 ± 0.495 RTF of WaveFit-5 was over 240 times faster than that of WaveRNN,
WaveFit-5 4.44 ± 0.05 0.070 ± 0.0020 and (ii) the naturalness of WaveFit with 5 iterations is comparable to
Ground-truth 4.50 ± 0.05 − WaveRNN and human natural speech, whereas early DDPM-based
models [42, 43] required over 100 iterations to achieve such quality.

Table 2. Side-by-side test results with their 95% confidence inter- 5.5. Comparison with GAN-based models
vals. A positive score indicates that Method-A was preferred.
The MOS and SxS results using the LibriTTS 1,000 evaluation sam-
Method-A Method-B SxS p-value ples are shown in Table 3. These results show that WaveFit-5 was
WaveFit-3 InferGrad-3 0.375 ± 0.073 0.0000 significantly better than MB-MelGAN, and there was no significant
WaveFit-3 WaveRNN −0.051 ± 0.044 0.0027 difference in naturalness between WaveFit-5 and HiFi-GAN V 1. In
terms of the model complexity, the RTF and model size of WaveFit-
WaveFit-5 InferGrad-5 0.063 ± 0.050 0.0012 5 are 0.07 and 13.8M, respectively, which are comparable to those
WaveFit-5 WaveRNN −0.018 ± 0.044 0.2924 of HiFi-GAN V 1 reported in the original paper [33], 0.065 and
WaveFit-5 Ground-truth −0.027 ± 0.037 0.0568 13.92M, respectively. These results indicate that WaveFit-5 is com-
parable in the model complexity and naturalness with the well-tuned
HiFi-GAN V 1 model on LibriTTS dataset.
DDPM, which is required to make q(xt−1 | xt , c) Gaussian [53,70].
We found that some outputs from WaveFit-5 were contaminated
In contrast, WaveFit denoises an intermediate signal without adding
by pulsive artifacts. When we trained WaveFit using clean dataset
random noise. Therefore, noise reduction level at each iteration can
recorded in an anechoic chamber (dataset used in the experiments
be small. This characteristics allow WaveFit to achieve higher audio
of Sec. 5.4), such artifacts were not observed. In contrast, the
quality with less iterations. We provide intermediate output exam-
target waveform used in this experiment was not totally clean but
ples of these models in our demo page.2
contained some noise, which resulted in the erroneous output sam-
ples. This result indicates that WaveFit models might not be robust
5.4. Comparison with WaveRNN and DDPM-based models against noise and reverberation in the training dataset. We used the
The MOS and RTF results and the SxS results using the 1,000 eval- SpecGrad architecture from [46] both for WaveFit and DDPM-based
uation samples are shown in Tables 1 and 2, respectively. In all three models because we considered that the DDPM-based models are di-
iteration models, WaveFit produced better quality than both Spec- rect competitor of WaveFit and that using the same architecture pro-
Grad [46] and InferGrad [49]. As InferGrad used the same network vides a fair comparison. After we realized the superiority of WaveFit
architecture and adversarial loss as WaveFit, the main differences over the other DDPM-based models, we performed comparison with
between them are (i) whether to compute loss value for all interme- GAN-based models, and hence the model architecture of WaveFit in
diate outputs, and (ii) whether to add random noise at each iteration. the current version is not so sophisticated compared to GAN-based
These MOS and SxS results indicate that fixed-point iteration is a models. Indeed, WaveFit-1 is significantly worse than GAN-based
better strategy than DDPM for iterative-style neural vocoders. models, which can be heard form our audio demo.2 There is a lot of
InferGrad-3 was significantly better than SpecGrad-3. The dif- room for improvement in the performance and robustness of Wave-
ference between InferGrad-3 and SpecGrad-3 is the use of λIF only. Fit by seeking a proper architecture for F, which is left as a future
This result suggests that the hypothesis in Sec. 4.3, the mapping in work.
DDPM-based neural vocoders only focuses on removing random
components in input signals rather than moving the input signals 6. CONCLUSION
towards the target, is supported. Therefore, incorporating the differ-
ence between generated and target speech into the loss function of This paper proposed WaveFit, which integrates the essence of GANs
iterative-style neural vocoders is a promising approach. into a DDPM-like iterative framework based on fixed-point iteration.
On the RTF comparison, WaveFit models were slightly faster WaveFit iteratively denoises an input signal like DDPMs while not
than SpecGrad and InferGrad models with the same number of iter- adding random noise at each iteration. This strategy was realized by
ations. This is because DDPM-based models need to sample a noise training a DNN using a loss inspired by the concept of the fixed-point
waveform at each iteration, whereas WaveFit requires it only at the theory. The subjective listening experiments showed that WaveFit
first iteration. can generate a speech waveform whose quality is better than con-
Although WaveFit-3 was worse than WaveRNN, WaveFit-5 ventional DDPM models. We also showed that the quality achieved
achieved the naturalness comparable to WaveRNN and human nat- by WaveFit with five iterations was comparable to WaveRNN and
ural speech; there were no significant differences in the SxS tests human natural speech, while its inference speed was more than 240
with α = 0.01. We would like to highlight that (i) the inference times faster than WaveRNN.
7. REFERENCES [18] J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: High-fidelity de-
noising and dereverberation based on speech deep features in
[1] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, and adversarial networks,” in Proc. Interspeech, 2020.
J. Sotelo, “SampleRNN: An unconditional end-to-end neural
[19] ——, “HiFi-GAN-2: Studio-quality speech enhancement via
audio,” in Proc. ICLR, 2018.
generative adversarial networks conditioned on acoustic fea-
[2] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and tures,” in Proc. IEEE WASPAA, 2021.
T. Toda, “Speaker-dependent WaveNet vocoder.” in Proc. In-
[20] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. L. Wang, C. Huang, and
terspeech, 2017.
Y. Wang, “VoiceFixer: Toward general speech restoration with
[3] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A neural vocoder,” arXiv:2109.13731, 2021.
flow-based generative network for speech synthesis,” in Proc.
[21] T. Saeki, S. Takamichi, T. Nakamura, N. Tanji, and
ICASSP, 2019.
H. Saruwatari, “SelfRemaster: Self-supervised speech restora-
[4] W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A com- tion with analysis-by-synthesis approach using channel model-
pact flow-based model for raw audio,” in Proc. ICML, 2020. ing,” in Proc. Interspeech, 2022.
[5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, [22] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg,
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Q. Wang, and T. C. Walters, “WaveNet based low rate speech
Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS synthesis by coding,” in Proc. ICASSP, 2018.
conditioning WaveNet on mel spectrogram predictions,” in
[23] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and
Proc. ICASSP, 2018.
K. Tokuda, “WaveNet-based zero-delay lossless speech cod-
[6] J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and ing,” in Proc. SLT, 2018.
Y. Wu, “Non-attentive Tacotron: Robust and controllable neu-
[24] J.-M. Valin and J. Skoglund, “A real-time wideband neural
ral TTS synthesis including unsupervised duration modeling,”
vocoder at 1.6kb/s using LPCNet,” in Proc. Interspeech, 2019.
arXiv:2010.04301, 2020.
[25] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and
[7] I. Elias, H. Zen, J. Shen, Y. Zhang, Y. Jia, R. J. Weiss, and
M. Tagliasacchi, “SoundStream: An end-to-end neural audio
Y. Wu, “Parallel Tacotron: Non-autoregressive and control-
codec,” IEEE/ACM Trans. Audio, Speech and Lang. Proc.,
lable TTS,” in Proc. ICASSP, 2021.
2022.
[8] Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT:
[26] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
Augmented BERT on phonemes and graphemes for neural
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
TTS,” in Proc. Interspeech, 2021.
K. Kavukcuoglu, “WaveNet: A generative model for raw au-
[9] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. dio,” arXiv:1609.03499, 2016.
Liu, “FastSpeech: Fast, robust and controllable text to speech,”
[27] N. Kalchbrenner, W. Elsen, K. Simonyan, S. Noury,
in Proc. NeurIPS, 2019.
N. Casagrande, W. Lockhart, F. Stimberg, A. van den Oord,
[10] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio syn-
Liu, “FastSpeech 2: Fast and high-quality end-to-end text to thesis,” in Proc. ICML, 2018.
speech,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021.
[28] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural
[11] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview speech synthesis through linear prediction,” in Proc. ICASSP,
of voice conversion and its challenges: From statistical mod- 2019.
eling to deep learning,” IEEE/ACM Trans. Audio Speech Lang.
[29] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan,
Process., 2021.
O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lock-
[12] W.-C. Huang, S.-W. Yang, T. Hayashi, and T. Toda, “A com- hart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe,
parative study of self-supervised speech representation based S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen,
voice conversion,” EEE J. Sel. Top. Signal Process., 2022. A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis,
[13] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, “Parallel WaveNet: Fast high-fidelity speech synthesis.” in
Z. Chen, and Y. Wu, “Direct speech-to-speech translation with Proc. ICML, 2018.
a sequence-to-sequence model,” in Proc. Interspeech, 2019. [30] D. J. Rezende and S. Mohamed, “Variational inference with
[14] Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, normalizing flows,” in Proc. ICML, 2015.
“Translatotron 2: High-quality direct speech-to-speech trans- [31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
lation with voice preservation,” in Proc. ICML, 2022. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-
[15] A. Lee, P.-J. Chen, C. Wang, J. Gu, S. Popuri, X. Ma, versarial nets,” in Proc. NeurIPS, 2014.
A. Polyak, Y. Adi, Q. He, Y. Tang, J. Pino, and W.-N. Hsu, “Di- [32] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio
rect speech-to-speech translation with discrete units,” in Proc. synthesis,” in Proc. ICLR, 2019.
60th Annu. Meet. Assoc. Comput. Linguist. (Vol. 1: Long Pap.),
[33] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversar-
2022.
ial networks for efficient and high fidelity speech synthesis,” in
[16] S. Maiti and M. I. Mandel, “Parametric resynthesis with neural Proc. NeurIPS, 2020.
vocoders,” in Proc. IEEE WASPAA, 2019.
[34] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,
[17] ——, “Speaker independence of neural vocoders and their ef- J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville,
fect on parametric resynthesis speech enhancement,” in Proc. “MelGAN: Generative adversarial networks for conditional
ICASSP, 2020. waveform synthesis,” in Proc. Adv. Neural Inf. Process. Syst.
(NeurIPS), 2019.
[35] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: [52] P. L. Combettes and J.-C. Pesquet, “Fixed point strategies in
A fast waveform generation model based on generative adver- data science,” IEEE Trans. Signal Process., 2021.
sarial networks with multi-resolution spectrogram,” in Proc. [53] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis-
ICASSP, 2020. tic models,” in Proc. NeurIPS, 2020.
[36] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi- [54] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech en-
band MelGAN: Faster waveform generation for high-quality hancement in the waveform domain,” in Proc. Interspeech,
text-to-speech,” in Proc. IEEE SLT, 2021. 2020.
[37] J. You, D. Kim, G. Nam, G. Hwang, and G. Chae, “GAN [55] G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A. Bouman,
vocoder: Multi-resolution discriminator is all you need,” “Plug-and-play unplugged: Optimization-free reconstruction
arXiv:2103.05236, 2021. using consensus equilibrium,” SIAM J. Imaging Sci., 2018.
[38] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: [56] E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-
A neural vocoder with multi-resolution spectrogram discrim- and-play methods provably converge with properly trained de-
inators for high-fidelity waveform generation,” in Proc. Inter- noisers,” in Proc. ICML, 2019.
speech, 2021.
[57] J.-C. Pesquet, A. Repetti, M. Terris, and Y. Wiaux, “Learning
[39] T. Kaneko, K. Tanaka, H. Kameoka, and S. Seki, “iSTFTNet: maximally monotone operators for image recovery,” SIAM J.
Fast and lightweight mel-spectrogram vocoder incorporating Imaging Sci., 2021.
inverse short-time Fourier transform,” in Proc. ICASSP, 2022.
[58] Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and
[40] T. Bak, J. Lee, H. Bae, J. Yang, J.-S. Bae, and Y.-S. Joo,
N. Harada, “Deep Griffin-Lim iteration,” in Proc. ICASSP,
“Avocodo: Generative adversarial network for artifact-free
2019.
vocoder,” arXiv:2206.13404, 2022.
[59] ——, “Deep Griffin–Lim iteration: Trainable iterative phase
[41] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon,
reconstruction using neural network,” IEEE J. Sel. Top. Signal
“BigVGAN: A universal neural vocoder with large-scale train-
Process., 2021.
ing,” arXiv:2206.04658, 2022.
[42] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and [60] R. Cohen, M. Elad, and P. Milanfar, “Regularization by denois-
W. Chan, “WaveGrad: Estimating gradients for waveform gen- ing via fixed-point projection (RED-PRO),” SIAM J. Imaging
eration,” in Proc. ICLR, 2021. Sci., 2021.

[43] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diff- [61] H. H. Bauschke and P. L. Combettes, Convex Analysis and
Wave: A versatile diffusion model for audio synthesis,” in Monotone Operator Theory in Hilbert Spaces. Springer, 2017.
Proc. ICLR, 2021. [62] I. Yamada, M. Yukawa, and M. Yamagishi, Minimizing the
[44] M. W. Y. Lam, J. Wang, D. Su, and D. Yu, “BDDM: Bilateral Moreau envelope of nonsmooth convex functions over the fixed
denoising diffusion models for fast and high-quality speech point set of certain quasi-nonexpansive mappings. Springer,
synthesis,” in Proc. ICLR, 2022. 2011, pp. 345–390.
[45] S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, [63] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends
W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving con- Optim., 2014.
ditional denoising diffusion models with data-dependent adap- [64] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek, “SEANet:
tive prior,” in Proc. ICLR, 2022. A multi-modal speech enhancement network,” in Proc. Inter-
[46] Y. Koizumi, H. Zen, K. Yatabe, N. Chen, and M. Bacchi- speech, 2020.
ani, “SpecGrad: Diffusion probabilistic model based neural [65] S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning
vocoder with adaptive noise spectral shaping,” in Proc. Inter- in a world of projections,” IEEE Signal Process. Mag., 2011.
speech, 2022.
[66] T. Hayashi, “Parallel WaveGAN implementation with Py-
[47] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Noise level torch,” github.com/kan-bayashi/ParallelWaveGAN.
limited sub-modeling for diffusion probabilistic vocoders,” in
Proc. ICASSP, 2021. [67] H. Zen, R. Clark, R. J. Weiss, V. Dang, Y. Jia, Y. Wu, Y. Zhang,
and Z. Chen, “LibriTTS: A corpus derived from LibriSpeech
[48] K. Goel, A. Gu, C. Donahue, and C. Ré, “It’s Raw! audio for text-to-speech,” in Proc. Interspeech, 2019.
generation with state-space models,” arXiv:2202.09729, 2022.
[68] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic
[49] Z. Chen, X. Tan, K. Wang, S. Pan, D. Mandic, L. He, and optimization,” in Proc. ICLR, 2015.
S. Zhao, “InferGrad: Improving diffusion models for vocoder
by considering inference in training,” in Proc. ICASSP, 2022. [69] A. A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek,
and N. Kalchbrenner, “A spectral energy distance for parallel
[50] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative speech synthesis,” in Proc. NeurIPS, 2020.
learning trilemma with denoising diffusion GANs,” in Proc.
ICLR, 2022. [70] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Gan-
guli, “Deep unsupervised learning using nonequilibrium ther-
[51] S. Liu, D. Su, and D. Yu, “DiffGAN-TTS: High-fidelity
modynamic,” in Proc. ICML, 2015.
and efficient text-to-speech with denoising diffusion GANs,”
arXiv:2201.11972, 2022.

You might also like