Issue 01 / Field notes for practical AI
AIAI Tutorials Hub
audio

ElevenLabs voice cloning: the settings that actually sound like you

A practical guide to ElevenLabs voice cloning: how to record source audio, the stability/clarity/similarity settings that matter, and the gotchas that wreck most clones.

Updated
Read time
8 min read
Difficulty
Intermediate
Author
By the AI Tutorials Hub editors

ElevenLabs voice cloning: the settings that actually sound like you

Voice cloning sounds easy — upload a sample, get a synthetic voice back. The reality is that most cloned voices sound robotic, breathy, or just slightly "off" in a way that is hard to pinpoint. The cause is almost always the source audio and the settings, not the model. This guide covers what to record, what settings to use, and how to clean a sample in Audacity when the recording is not ideal.

What you'll learn

  • How to record source audio that produces a clean clone
  • The three settings that matter (stability, clarity, similarity) and how to set them
  • How to clean a sample with Audacity
  • The gotchas — accent, emotion, length limits
  • A checklist for "why does my clone sound off?"

Why most clones sound robotic

The model is the same for everyone. The variable is the source audio. Three things cause the "AI robotic" sound:

  1. Background noise — HVAC hum, room echo, computer fan, keyboard clicks.
  2. Inconsistent mic distance — the speaker moves toward and away from the mic, changing volume.
  3. Wrong emotional register — the sample is read in a flat, "presenting" tone, but the clone is asked to sound conversational.

Fix all three and the output changes dramatically.

How to record source audio that produces a clean clone

The recording environment

The single most important variable. You do not need a studio, but you need:

  • A quiet room. Close the door. Turn off HVAC for 5 minutes. Silence your phone.
  • Soft surfaces. A bedroom with a closed closet full of clothes is fine. A kitchen with hard countertops is not.
  • No reverb. Clap once. If you hear a long tail, the room is too reflective. Move to a softer room.

The microphone

A $100 condenser USB mic (Audio-Technica AT2020, Fifine K669) is plenty. A headset mic is usually too noisy. A laptop mic is unusable for cloning.

The mic should be 6-10 inches (15-25 cm) from your mouth, slightly off-axis (not directly in front) to avoid plosives ("p" and "b" pops).

The script

ElevenLabs recommends 1-3 minutes of clean audio for an Instant Voice Clone, and 30+ minutes for a Professional Voice Clone.

The script should be:

  • Varied in sentence structure. Not all "the quick brown fox" — include questions, exclamations, lists.
  • Varied in emotion. Some sentences neutral, some excited, some thoughtful.
  • Your natural speaking pace. Do not slow down to "sound professional" — the clone will pick up the slower pace.
  • Free of filler words ("um," "uh," "like"). Fillers in the source will appear in the clone.

A good test script:

The quick brown fox jumps over the lazy dog. Pack my box with five
dozen liquor jugs. How vexingly quick daft zebras jump! Bright vixens
jump; dozy fowl quack. The five boxing wizards jump quickly.

I started my own company three years ago because I wanted to solve a
problem I had personally. The first year was hard. The second year
was harder. The third year, things started to work.

What keeps me up at night? Honestly, it's the gap between what I
know we can build and what we have time to build this quarter.

Read it at your natural pace. Time it: the script is about 60 seconds.

Tip
Record 3-5 minutes, not 1. The model gets better with more material. For Instant Voice Clones, 3 minutes is the sweet spot. For Professional Voice Clones, 30+ minutes.

The recording software

Audacity (free) or any DAW. Record at 44.1 kHz, 16-bit, mono. Save as WAV. ElevenLabs accepts WAV, MP3, and M4A — WAV is the safest.

The three settings that matter

Once you have uploaded the sample, ElevenLabs exposes three sliders. Most users ignore them, which is the mistake.

Stability (0-100)

How "expressive" the voice is. Low stability = more emotional variation, more dynamic range, more risk of weird artifacts. High stability = more monotone, more predictable, more "robotic."

For a natural-sounding clone:

  • Start at 50. If the output sounds flat, lower to 35-40.
  • If the output has weird artifacts (sudden pitch shifts, broken words), raise to 60-70.

Similarity (0-100)

How closely the clone matches the source voice. High similarity = more like the source, more risk of artifacts if the source had any. Low similarity = more "generic," less like you.

For a natural-sounding clone:

  • Start at 75. If the output does not sound like you, raise to 85-90.
  • If the output has artifacts (weird timbre, broken consonants), lower to 60-65.

Style exaggeration (0-100)

Only on some models. How much the clone amplifies the emotional style of the source.

For a natural-sounding clone:

  • Start at 0. Increase only if the output is too flat and you have already tried lowering stability.
Tip
Always set Speaker Boost ON. It is a free quality boost that does not change the voice character but improves the output volume.

How to clean a sample with Audacity

If your recording is not ideal (background noise, plosives, room echo), clean it before uploading to ElevenLabs. Five Audacity steps take 5 minutes:

  1. Noise reduction. Select 2-3 seconds of "silence" at the start of the recording. Effect → Noise Reduction → Get Noise Profile. Then select the entire track. Effect → Noise Reduction → OK (default settings).
  2. EQ — cut low rumble. Effect → Filter Curve EQ → Preset → Low Roll-off for Speech. This cuts frequencies below 80 Hz that contribute to "muddiness."
  3. Compressor. Effect → Compressor. Threshold -15 dB, Ratio 3:1, Attack 0.1s, Release 0.5s. This evens out the volume.
  4. Normalize. Effect → Normalize → -3 dB. Brings the peak to a safe level.
  5. Export as WAV. File → Export → Export as WAV → 16-bit PCM.
Tip
If the recording has plosives ("p" and "b" pops), use Effect → De-Clicker or manually trim the loudest 0.1 seconds of each pop with the selection tool. Plosives in the source become amplified plosives in the clone.

Gotchas

1. Accent and language mismatch

If your source is in English with a non-American accent, the clone will retain the accent. ElevenLabs will sometimes "Americanize" the accent in the output, which sounds worse than the source. For a faithful clone, use the multilingual v2 model and explicitly set the language to match.

2. Emotion in the source carries over

If your source is read in a sad tone, the clone will sound sad. If your source is read in an "announcer" tone, the clone will sound like a TV announcer. The source emotional register is the floor of what the clone can produce.

3. Length limits

ElevenLabs has per-generation length limits (typically 5000 characters per call). For longer outputs, you must stitch multiple generations. The stitching is where the "AI voice" feeling creeps in — make sure to break at natural sentence boundaries, not mid-sentence.

4. The "uncanny valley" voice

The single most common feedback I get on cloned voices is "it sounds 95% like you, but there's something off." This is usually a single artifact — a slightly too-long pause, a too-perfect pronunciation of a colloquial word, an absence of breath. The fix is to add 1-2 breaths to the source recording (just take a few audible breaths during recording — the model will reproduce them as natural pauses).

5. The "ownership" question

If you clone someone else's voice without their consent, the output is illegal in many jurisdictions (right of publicity laws). ElevenLabs' terms require that you own the rights to any voice you clone. The exception is the pre-made "Voice Library" voices, which are licensed for use.

A quick checklist for "why does my clone sound off?"

Walk through these in order:

  1. Did you record in a quiet room with a real mic? (No → re-record.)
  2. Did you record 3+ minutes of varied speech? (No → re-record more.)
  3. Did you clean the audio in Audacity? (No → run the 5-step cleanup.)
  4. Is Stability around 50, Similarity around 75? (No → adjust.)
  5. Is the source emotional register the one you want? (No → re-record with a different tone.)
  6. Did you add a few audible breaths to the source? (No → re-record with breaths.)

If you have done all six and the output is still off, the model version or the language setting may be wrong. Try a different ElevenLabs model (Multilingual v2 vs Turbo v2) and see if that helps.

FAQ

Is cloned voice ownership mine?

For voices you clone from your own recordings, yes. For voices you clone from someone else without their consent, no. For pre-made Voice Library voices, the licensing terms apply (commercial use is usually included).

Can I clone someone else's voice?

Technically yes, ElevenLabs will not stop you. Legally and ethically, you need their explicit consent. Some US states have right-of-publicity laws that make non-consensual cloning illegal.

Why does my clone sound off?

Walk through the 6-step checklist. 90% of the time, the answer is in steps 1-3 (recording environment, mic, source length).

How long can a generated voice be?

Per-generation, around 5000 characters (about 8-10 minutes of audio). For longer audio, stitch multiple generations at natural sentence boundaries.

Can I use the clone for a podcast or YouTube video?

Yes, for voices you own the rights to. Add a disclosure ("This audio uses a synthetic voice clone of [your name]") to be safe — some platforms require it.

What's the difference between Instant Clone and Professional Clone?

Instant Clone uses 1-3 minutes of audio and produces a "good enough" clone in seconds. Professional Clone uses 30+ minutes of audio and produces a higher-fidelity clone after manual review by ElevenLabs (24-48 hours). For most users, Instant Clone is enough. For commercial audiobook or film work, Professional Clone is worth it.

Does the model work for languages other than English?

Yes. The Multilingual v2 model supports 29 languages. The source audio can be in any of them; the output language follows the language of the text-to-speech input, not the source.

Frequently asked questions

Is cloned voice ownership mine?

For voices you clone from your own recordings, yes. For voices you clone from someone else without their consent, no. For pre-made Voice Library voices, the licensing terms apply.

Can I clone someone else's voice?

Technically yes. Legally and ethically, you need their explicit consent. Some US states have right-of-publicity laws that make non-consensual cloning illegal.

Why does my clone sound off?

Walk through the 6-step checklist in the guide. 90% of the time, the answer is recording environment, mic, or source length.

What's the difference between Instant Clone and Professional Clone?

Instant Clone uses 1-3 minutes of audio and produces a 'good enough' clone in seconds. Professional Clone uses 30+ minutes of audio and produces a higher-fidelity clone after manual review by ElevenLabs (24-48 hours).

Does the model work for languages other than English?

Yes. The Multilingual v2 model supports 29 languages. The source audio can be in any of them; the output language follows the language of the text-to-speech input.

Related tutorials