Voice Creation

When cloning a voice, it’s important to consider what the AI has been trained on: which languages and what type of dataset. In this case, the following are available

 

Multilingual v2 -  Introducing our latest model, Multilingual v2, which stands as a testament to our dedication to progress. This model is a powerhouse, excelling in stability, language diversity, and accuracy in replicating accents and voices. Its speed and agility are remarkable considering its size. Multilingual v2 supports 28 languages, including:

 

  • English (USA)
  • English (UK)
  • English (Australia)
  • English (Canada)
  • Japanese
  • Chinese
  • German
  • Hindi
  • French (France)
  •  French (Canada)
  • Korean
  • Portuguese (Brazil)
  • Portuguese (Portugal)
  • Italian
  • Spanish (Spain)
  • Spanish (Mexico)
  • Indonesian
  • Dutch
  • Turkish
  • Filipino
  • Polish
  • Swedish
  • Bulgarian
  • Romanian
  • Arabic (Saudi Arabia)
  • Arabic (UAE)
  • Czech
  • Greek
  • Finnish
  • Croatian
  • Malay
  • Slovak
  • Danish
  • Tamil
  • Ukrainian

and the dataset is quite varied, especially for the multilingual v2, but mainly consists of
audiobook-type material.
As mentioned earlier, if the voice you try to clone falls outside of these parameters or outside
of what the AI has heard during training, it might have a hard time replicating the voice
perfectly using instant voice cloning.
How the audio was recorded is more important than the total length (total runtime) of the
samples. The number of samples you use doesn’t matter; it is the total combined length
(total runtime) that is the important part.
Approximately 1-2 minutes of clear audio without any reverb, artifacts, or background noise
of any kind appears to be the sweet spot. When we speak of “audio or recording quality,” we
do not mean the codec, such as MP3 or WAV; we mean how the audio was captured.
However, regarding audio codecs, using MP3 at 128 kbps and above seems to work just
fine, and higher bitrates don’t seem to markedly improve the quality of the clone.
The AI will attempt to mimic everything it hears in the audio; the speed of the person talking
as well as the inflections, the accent and tonality, breathing pattern and strength, as well as
noise and mouth clicks and everything else, including noise and artefacts which can confuse
it.

Another important thing to keep in mind is that the AI will try to replicate the performance of
the voice you provide. If you talk in a slow, monotone voice without much emotion, that is
what the AI will mimic. On the other hand, if you talk quickly with much emotion, that is what
the AI will try to replicate.
It is crucial that the voice remains consistent throughout all the samples, not only in tone but
also in performance. If there is too much variance, it might confuse the AI, leading to more
varied output between generations.

 

  • The most important aspect to get a proper clone is the voice itself, the language and
    accent, and the quality of the recording.
  • Audio length is less important than quality but still plays an important role up to a
    certain point. At a minimum, input audio should be 1 minute long. Avoid adding
    beyond 3 minutes; this will yield little improvement and can, in some cases, even be
    detrimental to the clone, making it more unstable.
  • Keep the audio consistent. Ensure that the voice maintains a consistent tone
    throughout, with a consistent performance. Also, make sure that the audio quality of
    the voice remains consistent across all the samples. Even if you only use a single
    sample, ensure that it remains consistent throughout the full sample. Feeding the AI
    audio that is very dynamic, meaning wide fluctuations in pitch and volume, will yield
    less predictable results.
  • Find a good balance for the volume so the audio is neither too quiet nor too loud. The
    ideal would be between -23 dB and -18 dB RMS with a true peak of -3 dB.

If you are unsure about what is permissible from a legal standpoint, please consult the Terms
of Service for more information.


Was this article helpful?