A Voice Cloning Method Based on the Improved HiFi-GAN Model

<table class="table-group" id="tab6"><tr><td><table class="table"><tr><td class="thead-hr" colspan="5"><hr/></td></tr><tr class="thead"><td class="align_left">Metric</td><td class="align_center">Settings</td><td class="align_center">LibriSpeech</td><td class="align_center">VCTK</td><td class="align_center">THchs-30</td></tr><tr><td class="thead-hr" colspan="5"><hr/></td></tr><tr><td class="align_left" rowspan="8">MOS (CI)</td><td class="align_center">Multispeaker TTS</td><td class="align_center">3.93  <svg height="7.35473pt" id="M27" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.57  <svg height="7.35473pt" id="M28" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">3.64  <svg height="7.35473pt" id="M29" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.05</td></tr><tr><td class="align_center">Multispeaker TTS + <i>x</i>-vector</td><td class="align_center">4.02  <svg height="7.35473pt" id="M30" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">3.72  <svg height="7.35473pt" id="M31" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td><td class="align_center">3.78  <svg height="7.35473pt" id="M32" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td></tr><tr><td class="align_center">WaveGlow + <i>d</i>-vector</td><td class="align_center">3.85  <svg height="7.35473pt" id="M33" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.49  <svg height="7.35473pt" id="M34" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">3.47  <svg height="7.35473pt" id="M35" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td></tr><tr><td class="align_center">WaveGlow + <i>x</i>-vector</td><td class="align_center">3.93  <svg height="7.35473pt" id="M36" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">3.74  <svg height="7.35473pt" id="M37" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">3.69  <svg height="7.35473pt" id="M38" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td></tr><tr><td class="align_center">HiFi-GAN + <i>d</i>-vector</td><td class="align_center">4.21  <svg height="7.35473pt" id="M39" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.10</td><td class="align_center">3.86  <svg height="7.35473pt" id="M40" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.92  <svg height="7.35473pt" id="M41" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td></tr><tr><td class="align_center">HiFi-GAN + <i>x-</i>vector</td><td class="align_center">4.30  <svg height="7.35473pt" id="M42" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">4.15  <svg height="7.35473pt" id="M43" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">4.13  <svg height="7.35473pt" id="M44" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td></tr><tr><td class="align_center">Improved HiFi-GAN + <i>d</i>-vector</td><td class="align_center">4.28  <svg height="7.35473pt" id="M45" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td><td class="align_center">4.06  <svg height="7.35473pt" id="M46" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.05</td><td class="align_center">4.11  <svg height="7.35473pt" id="M47" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.04</td></tr><tr><td class="align_center">Improved HiFi-GAN + <i>x</i>-vector</td><td class="align_center">4.36  <svg height="7.35473pt" id="M48" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">4.28  <svg height="7.35473pt" id="M49" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">4.28  <svg height="7.35473pt" id="M50" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td></tr><tr class="table-tr"><td colspan="5"><hr class="tbody-hr"/></td></tr></table></td></tr></table>

<div>MOS of cloning speech naturalness of different models.</div>

Computational Intelligence and Neuroscience

tab6

Table 6

Table 6: A Voice Cloning Method Based on the Improved HiFi-GAN Model