Instant and Professional Voice Cloning

Instant Voice Cloning (IVC) allows you to create voice clones from shorter samples instantly. Creating an instant voice clone does not train or create a custom AI model. Instead, it relies on prior knowledge from training data to make an educated guess rather than training on the exact voice. This works exceptionally well for a lot of voices.

However, the most significant limitation of IVC is if you are trying to clone a unique voice with a unique accent where the AI might not have heard a similar voice before during training.

In such cases, as creating a custom model with explicit training using Professional Voice Cloning (PVC) might be the best option, PVC is only available per user request.

Professional Voice Cloning (PVC), unlike Instant Voice Cloning (IVC), which lets you clone voices with very short samples nearly instantaneously, allows you to train a hyper-realistic model of a voice.

This is achieved by training a dedicated model on a large set of voice data to produce an indistinguishable model from the original voice.

Since the custom models require fine-tuning and training, training these Professional Voice Clones will take longer than the Instant Voice Clones. Giving an estimate is challenging as it depends on the number of people in the queue before you and a few other factors.

Here are the current estimates for Professional Voice Cloning:

English: ~30 minutes up to 3 hours

Multilingual: ~3 hours up to 6 hours

Before you start uploading your samples, you should be mindful of a few things and take steps to ensure the best possible results.

Firstly, Professional Voice Cloning is highly accurate in cloning the samples used for its training. It will create a near-perfect clone of what it hears, including all the intricacies and characteristics of that voice and any artefacts and unwanted audio present in the samples. If you upload low-quality samples with background noise, room reverb/echo, or any other type of unwanted sounds like music on multiple people speaking, the AI will also try replicating all of these elements in the clone.

Secondly, make sure there’s only a single speaking voice throughout the audio, as more than one speaker, excessive noise, or anything of the above can confuse the AI. This confusion can result in the AI being unable to discern which voice to clone or misinterpreting what the voice sounds like because it is masked by other sounds, leading to a less-than-optimal clone.

Thirdly, make sure you have enough material to clone the voice properly. The bare minimum we recommend is 30 minutes of audio, but for the optimal result and the most accurate clone, we recommend closer to 3 hours of audio. You might be able to get away with less, but at that point, we can’t vouch for the quality of the resulting clone.

Fourthly, the speaking style in the samples you provide will be replicated in the output, so depending on what delivery you are looking for, the training data should correspond to that style (e.g. if you are looking to voice an audiobook with a clone of your voice, the audio you submit for training should be a recording of you reading a book in the tone of voice you want to use). Including one style in the uploaded samples is better for consistency's sake.

Lastly, it’s best to use samples where you speak the language for which the PVC will mainly be used. Of course, the AI can speak any language that we currently support. However, it is worth noting that if the voice itself is not native to the language you want the AI to speak - meaning you cloned a voice speaking a different language - it might have an accent from the original language and mispronounce words and inflexions. For instance, if you clone a voice speaking English and then want it to speak Spanish, it will very likely have an English accent when speaking Spanish. We only support cloning samples recorded in one of our supported languages, and the application will reject your sample if it is recorded in an unsupported language.

For now, any Voice Air user with access to PVC can clone multiple voices, but only one voice will show on their account. If you choose to clone more, you must delete the current PVC on your account. We will help you with this. You will be asked to go through a verification process before submitting your fine-tuning request.

Professional Recording Equipment: Use high-quality recording equipment for optimal results, as the AI will clone everything about the audio—high-quality input = high-quality output. Any microphone will work, but we recommend an XLR mic going into a dedicated audio interface. A few general recommendations for low-end would be an Audio Technica AT2020 or a Rode NT1 going into a Focusrite interface or similar.

Use a Pop-Filter: Use a Pop-Filter when recording. This will minimize plosives when recording.

Microphone Distance: Position yourself at the right distance from the microphone. It is recommended to be approximately two fists away from the mic, but it also depends on what type of recording you want.

Noise-Free Recording: Ensure the audio input, such as background music or noise, is interference-free. AI cloning works best with clean, uncluttered audio.

Room Acoustics: Preferably, record in an acoustically treated room. This reduces unwanted echoes and background noises, leading to more apparent audio input for the AI. You can make something temporary by dampening the recording space with a thick duvet or quilt.

Audio Pre-processing: Consider editing your audio beforehand if you aim for a specific sound you want the AI to output. For instance, pre-process your audio to match that quality if you want a polished podcast-like output. If you have long pauses or many “uhm”s and “ahm”s between words, the AI will also mimic those.

Volume Control: Maintain a consistent volume loud enough to be precise but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with an actual peak of -3dB.

Sufficient Audio Length: Provide at least 30 minutes of high-quality audio that follows the above guidelines for best results, preferably closer to 3 hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. The number of samples is irrelevant; the total runtime is what matters. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.

Uploading: Once you have submitted your files, you cannot change the clone, which will be locked in. Have you uploaded the correct samples you want to use?

Verify Your Voice: Once everything is recorded and uploaded, you will be asked to verify your voice. To ensure a smooth experience, please verify your voice using the same or similar equipment used to record the samples and in a tone and delivery like what was present.

If you cannot access the same equipment, try verifying the best you can. If it fails, you will have to reach out to support. I just wanted to let you know that all of this depends on your desired output. The AI will try to clone everything in the audio, but we suggest following the guidelines mentioned above so that the AI can work optimally and predictably.

Once you’ve uploaded your samples, we will contact you as soon as the processing is complete. We will add the PVC within twenty-four hours of submitting your files.