When cloning a voice, it’s important to consider what the AI has been trained on: which languages and what type of dataset. In this case, the following are available
Multilingual v2 - Introducing our latest model, Multilingual v2, which stands as a testament to our dedication to progress. This model is a powerhouse, excelling in stability, language diversity, and accuracy in replicating accents and voices. Its speed and agility are remarkable considering its size. Multilingual v2 supports 28 languages, including:
- English (USA)
- English (UK)
- English (Australia)
- English (Canada)
- Japanese
- Chinese
- German
- Hindi
- French (France)
- French (Canada)
- Korean
- Portuguese (Brazil)
- Portuguese (Portugal)
- Italian
- Spanish (Spain)
- Spanish (Mexico)
- Indonesian
- Dutch
- Turkish
- Filipino
- Polish
- Swedish
- Bulgarian
- Romanian
- Arabic (Saudi Arabia)
- Arabic (UAE)
- Czech
- Greek
- Finnish
- Croatian
- Malay
- Slovak
- Danish
- Tamil
- Ukrainian
and the dataset is quite varied, especially for the multilingual v2, but mainly consists of
audiobook-type material.
As mentioned earlier, if the voice you try to clone falls outside of these parameters or outside
of what the AI has heard during training, it might have a hard time replicating the voice
perfectly using instant voice cloning.
How the audio was recorded is more important than the total length (total runtime) of the
samples. The number of samples you use doesn’t matter; it is the total combined length
(total runtime) that is the important part.
Approximately 1-2 minutes of clear audio without any reverb, artifacts, or background noise
of any kind appears to be the sweet spot. When we speak of “audio or recording quality,” we
do not mean the codec, such as MP3 or WAV; we mean how the audio was captured.
However, regarding audio codecs, using MP3 at 128 kbps and above seems to work just
fine, and higher bitrates don’t seem to markedly improve the quality of the clone.
The AI will attempt to mimic everything it hears in the audio; the speed of the person talking
as well as the inflections, the accent and tonality, breathing pattern and strength, as well as
noise and mouth clicks and everything else, including noise and artefacts which can confuse
it.
Another important thing to keep in mind is that the AI will try to replicate the performance of
the voice you provide. If you talk in a slow, monotone voice without much emotion, that is
what the AI will mimic. On the other hand, if you talk quickly with much emotion, that is what
the AI will try to replicate.
It is crucial that the voice remains consistent throughout all the samples, not only in tone but
also in performance. If there is too much variance, it might confuse the AI, leading to more
varied output between generations.
- The most important aspect to get a proper clone is the voice itself, the language and
accent, and the quality of the recording. - Audio length is less important than quality but still plays an important role up to a
certain point. At a minimum, input audio should be 1 minute long. Avoid adding
beyond 3 minutes; this will yield little improvement and can, in some cases, even be
detrimental to the clone, making it more unstable. - Keep the audio consistent. Ensure that the voice maintains a consistent tone
throughout, with a consistent performance. Also, make sure that the audio quality of
the voice remains consistent across all the samples. Even if you only use a single
sample, ensure that it remains consistent throughout the full sample. Feeding the AI
audio that is very dynamic, meaning wide fluctuations in pitch and volume, will yield
less predictable results. - Find a good balance for the volume so the audio is neither too quiet nor too loud. The
ideal would be between -23 dB and -18 dB RMS with a true peak of -3 dB.
If you are unsure about what is permissible from a legal standpoint, please consult the Terms
of Service for more information.