A Voice Cloning Method Based on the Improved HiFi-GAN Model

<table class="table-group" id="tab7"><tr><td><table class="table"><tr><td class="thead-hr" colspan="5"><hr/></td></tr><tr class="thead"><td class="align_left">Metric</td><td class="align_center">Settings</td><td class="align_center">LibriSpeech</td><td class="align_center">VCTK</td><td class="align_center">THchs-30</td></tr><tr><td class="thead-hr" colspan="5"><hr/></td></tr><tr><td class="align_left" rowspan="8">SMOS (CI)</td><td class="align_center">Multispeaker TTS</td><td class="align_center">3.56  <svg height="7.35473pt" id="M51" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">3.18  <svg height="7.35473pt" id="M52" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.25  <svg height="7.35473pt" id="M53" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td></tr><tr><td class="align_center">Multispeaker TTS + <i>x</i>-vector</td><td class="align_center">3.91  <svg height="7.35473pt" id="M54" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.44  <svg height="7.35473pt" id="M55" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">3.59  <svg height="7.35473pt" id="M56" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td></tr><tr><td class="align_center">WaveGlow + <i>d</i>-vector</td><td class="align_center">3.55  <svg height="7.35473pt" id="M57" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td><td class="align_center">3.<svg height="8.69875pt" id="M58" style="vertical-align:-0.3499298pt" version="1.1" viewbox="-0.0498162 -8.34882 20.286 8.69875" width="20.286pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M384 0V27C293 34 287 42 287 114V635C232 613 172 594 109 583V559L157 557C201 555 205 550 205 499V114C205 42 199 34 109 27V0H384Z"></path></g><g transform="matrix(.013,0,0,-0.013,6.24,0)"><path d="M384 0V27C293 34 287 42 287 114V635C232 613 172 594 109 583V559L157 557C201 555 205 550 205 499V114C205 42 199 34 109 27V0H384Z"></path></g><g transform="matrix(.013,0,0,-0.013,12.48,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td><td class="align_center">3.32  <svg height="7.35473pt" id="M59" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td></tr><tr><td class="align_center">WaveGlow + <i>x</i>-vector</td><td class="align_center">3.89  <svg height="7.35473pt" id="M60" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">3.47  <svg height="7.35473pt" id="M61" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td><td class="align_center">3.64  <svg height="7.35473pt" id="M62" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.05</td></tr><tr><td class="align_center">HiFi-GAN + <i>d</i>-vector</td><td class="align_center">3.82  <svg height="7.35473pt" id="M63" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.05</td><td class="align_center">3.38  <svg height="7.35473pt" id="M64" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">3.43  <svg height="7.35473pt" id="M65" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.09</td></tr><tr><td class="align_center">HiFi-GAN + <i>x</i>-vector</td><td class="align_center">4.15  <svg height="7.35473pt" id="M66" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td><td class="align_center">3.61  <svg height="7.35473pt" id="M67" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">3.68  <svg height="7.35473pt" id="M68" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td></tr><tr><td class="align_center">Improved HiFi-GAN + <i>d</i>-vector</td><td class="align_center">3.99  <svg height="7.35473pt" id="M69" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.10</td><td class="align_center">3.52  <svg height="7.35473pt" id="M70" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.61  <svg height="7.35473pt" id="M71" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.05</td></tr><tr><td class="align_center">Improved HiFi-GAN + <i>x</i>-vector</td><td class="align_center">4.23  <svg height="7.35473pt" id="M72" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.06</td><td class="align_center">3.80  <svg height="7.35473pt" id="M73" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.08</td><td class="align_center">3.84  <svg height="7.35473pt" id="M74" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M535 290V340H323V533H265V340H52V290H265V84H323V290H535ZM535 -22V28H52V-22H535Z"></path></g></svg>  0.07</td></tr><tr class="table-tr"><td colspan="5"><hr class="tbody-hr"/></td></tr></table></td></tr></table>

<div>SMOS of cloning speech similarity of different models.</div>

Computational Intelligence and Neuroscience

tab7

Table 7

Table 7: A Voice Cloning Method Based on the Improved HiFi-GAN Model