What is PVC - Professional Voice Cloning ?

Professional Voice Cloning (PVC), unlike Instant Voice Cloning (IVC), which lets you clone voices with very short samples nearly
instantaneously, allows you to train a hyper-realistic model of a voice.

This is achieved by training a dedicated model on a large set of voice data to produce a model that’s indistinguishable from the original
voice.

Since the custom models require fine-tuning and training, it will take a bit longer to train these Professional Voice Clones compared to the Instant Voice Clones. Giving an estimate is challenging as it depends on the number of people in the queue before you and a few other factors.

Here are the current estimates for Professional Voice Cloning:

English: ~30 minutes up to 3 hours
Multilingual: ~3 hours up to 6 hours

There are a few things to be mindful of before you start uploading your samples, and some steps that you need to take to ensure the best possible results.

Firstly, Professional Voice Cloning is highly accurate in cloning the samples used for its training. It will create a near-perfect clone of
what it hears, including all the intricacies and characteristics of that voice, but also including any artefacts and unwanted audio present in the samples. This means that if you upload low-quality samples with background noise, room reverb/echo, or any other type of unwanted sounds like music on multiple people speaking, the AI will try to replicate all of these elements in the clone as well.

Secondly, make sure there’s only a single speaking voice throughout the audio, as more than one speaker or excessive noise or anything of the above can confuse the AI. This confusion can result in the AI being unable to discern which voice to clone or misinterpreting what the voice sounds like because it is being masked by other sounds, leading to a less-than-optimal clone.

Thirdly, make sure you have enough material to clone the voice properly. The bare minimum we recommend is 30 minutes of audio, but for the optimal result and the most accurate clone, we recommend closer to 3 hours of audio. You might be able to get away with less, but at that point, we can’t vouch for the quality of the resulting clone.

Fourthly, the speaking style in the samples you provide will be replicated in the output, so depending on what delivery you are looking
for, the training data should correspond to that style (e.g. if you are looking to voice an audiobook with a clone of your voice, the audio you submit for training should be a recording of you reading a book in the tone of voice you want to use). It is better to just include one style in the uploaded samples for consistency's sake.

Lastly, it’s best to use samples speaking where you are speaking the language that the PVC will mainly be used for. Of course, the AI can speak any language that we currently support. However, it is worth noting that if the voice itself is not native to the language you want the AI to speak - meaning you cloned a voice speaking a different language - it might have an accent from the original language and might mispronounce words and inflexions. For instance, if you clone a voice speaking English and then want it to speak Spanish, it will very likely have an English accent when speaking Spanish. We only support cloning samples recorded in one of our supported languages, and the application will reject your sample if it is recorded in an unsupported language.

For now, any Voice Air users will be able to clone multiple voices but only one voice will show on your account, so if you choose to clone more you will have to delete the current PVC on your account, we will help you with this. You will be asked to go through a verification process before submitting your fine-tuning request.

Professional Recording Equipment: Use high-quality recording equipment for optimal results as the AI will clone everything about the audio. High-quality input = high-quality output. Any microphone will work, but an XLR mic going into a dedicated audio interface would be our recommendation. A few general recommendations on low-end would be something like an Audio Technica AT2020 or a Rode NT1 going into a Focusrite interface or similar.

Use a Pop-Filter: Use a Pop-Filter when recording. This will minimize plosives when recording.

Microphone Distance: Position yourself at the right distance from the microphone - approximately two fists away from the mic is recommended, but it also depends on what type of recording you want.

Noise-Free Recording: Ensure that the audio input doesn’t have any interference, like background music or noise. The AI cloning works best with clean, uncluttered audio.

Room Acoustics: Preferably, record in an acoustically-treated room. This reduces unwanted echoes and background noises, leading to clearer audio input for the AI. You can make something temporary using a thick duvet or quilt to dampen the recording space.

Audio Pre-processing: Consider editing your audio beforehand if you’re aiming for a specific sound you want the AI to output. For instance, if you want a polished podcast-like output, pre-process your audio to match that quality, or if you have long pauses or many “uhm”s and “ahm”s between words the AI will mimic those as well.

Volume Control: Maintain a consistent volume that’s loud enough to be clear but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with a true peak of -3dB.

Sufficient Audio Length: Provide at least 30 minutes of high-quality audio that follows the above guidelines for best results - preferably closer to 3 hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. The number of samples is irrelevant; the total runtime is what matters. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.

Uploading: Once you have submitted your files, you will not be able to make any changes to the clone and it will be locked in. Ensure that you have uploaded the correct samples that you want to you.

Verify Your Voice: Once everything is recorded and uploaded, you will beasked to verify your voice. To ensure a smooth experience, please try to verify your voice using the same or similar equipment used to record the samples and in a tone and delivery that is similar to what was present in the samples.

If you do not have access to the same equipment, try verifying the best you can. If it fails, you will have to reach out to support.
Keep in mind that all of this depends on the output you want. The AI will try to clone everything in the audio, but for the AI to work
optimally and predictably, we suggest following the guidelines mentioned above.

Once you’ve uploaded your samples, we will contact you as soon as the processing is complete. We will add the PVC within twenty-four hours of submitting your files.