Verbatik LogoVerbatik

Voice Cloning

Clone a voice from an audio sample and generate speech with full control over speed, pitch, emotion, and more.

Voice Cloning

Clone a custom voice from an audio sample, then use it to generate speech with full control over speed, pitch, emotion, and more.

Overview

Voice cloning is a two-step process:

  1. Clone a voice — Upload an audio sample to create a custom voice ID.
  2. Generate speech — Use the cloned voice ID to synthesize speech from text.

Step 1: Clone a Voice

Endpoint

POST /api/v1/voice-training

Audio Requirements

RequirementValue
Minimum duration10 seconds
Maximum file size20 MB
Supported formats.mp3, .wav
RecommendedClear speech, minimal background noise

Request

Headers:

HeaderRequiredDescription
AuthorizationYesBearer YOUR_API_KEY
Content-TypeYesapplication/json

Body:

{
  "audio_url": "https://example.com/my-voice-sample.mp3",
  "name": "My Custom Voice",
  "noise_reduction": true,
  "volume_normalization": true,
  "accuracy": 0.8,
  "preview_text": "Hello, this is a preview of my cloned voice."
}
FieldRequiredTypeDescription
audio_urlYesstringURL to the audio file (must be publicly accessible).
nameNostringName for the voice. Auto-generated if not provided.
noise_reductionNobooleanEnable noise reduction. Default: false.
volume_normalizationNobooleanNormalize volume levels. Default: false.
accuracyNonumberText validation accuracy (0–1). Default: 0.8.
preview_textNostringCustom text for the preview audio clip.

Response

{
  "success": true,
  "voice_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "name": "My Custom Voice",
  "preview_url": "https://storage.verbatik.com/audio/preview-abc123.mp3",
  "cost_cents": 300,
  "balance_cents": 1700
}

Cost: $3.00 per voice.

Step 2: Generate Speech with a Cloned Voice

Endpoint

POST /api/v1/voice-cloning

Headers

HeaderRequiredDescription
AuthorizationYesBearer YOUR_API_KEY
Content-TypeYestext/plain
X-Voice-IDYesThe voice ID from the cloning step.
X-Store-AudioNotrue to store audio and get a URL. Default: false.

Voice Settings (optional headers)

HeaderRangeDefaultDescription
X-Speed0.5–2.01Speech speed multiplier.
X-Volume0–101Volume level.
X-Pitch-12 to 120Pitch adjustment in semitones.
X-EmotionSee belowEmotion of the generated speech.
X-English-Normalizationtrue/falseImprove number and abbreviation reading.

Voice Modification (advanced, optional)

HeaderRangeDescription
X-Voice-Modify-Pitch-100 to 100Fine-grained pitch adjustment.
X-Voice-Modify-Intensity-100 to 100Energy/intensity of the voice.
X-Voice-Modify-Timbre-100 to 100Tonal quality adjustment.

Audio Settings (optional)

HeaderValuesDefaultDescription
X-Sample-Rate8000, 16000, 22050, 24000, 32000, 4410032000Sample rate in Hz.
X-Bitrate32000, 64000, 128000, 256000128000Audio bitrate.
X-Formatmp3, pcm, flacmp3Output format.
X-Language-BoostLanguage codeEnhance language recognition.

Body: Plain text to synthesize (max 5,000 characters).

Supported Emotions

happy, sad, angry, fearful, disgusted, surprised, neutral

Special Text Features

Interjection tags — Insert natural speech sounds: (laughs), (sighs), (coughs), (clears throat), (gasps), (sniffs), (groans), (yawns)

Pause markers — Insert precise pauses: <#x#> where x is the duration in seconds (0.01–99.99). Example: Hello <#1.5#> world

Pricing

$0.08 per 1,000 characters ($80 per 1M characters).

Example

curl -X POST https://api.verbatik.com/api/v1/voice-cloning \
  -H "Authorization: Bearer vbt_your_api_key" \
  -H "Content-Type: text/plain" \
  -H "X-Voice-ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "X-Speed: 1.1" \
  -H "X-Emotion: happy" \
  -H "X-Format: mp3" \
  --data "Hello! This is my cloned voice speaking with a happy emotion." \
  --output cloned-speech.mp3

Managing Cloned Voices

List Your Cloned Voices

GET /api/v1/my-voices
ParameterTypeDescription
statusstringFilter: pending, ready, or failed.
[
  {
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "name": "My Custom Voice",
    "status": "ready",
    "preview_url": "https://storage.verbatik.com/audio/preview-abc123.mp3",
    "source_audio_url": "https://example.com/my-voice-sample.mp3",
    "created_at": "2025-01-15T10:30:00.000Z",
    "last_used_at": "2025-01-20T14:00:00.000Z"
  }
]

Voice Statuses

StatusDescription
pendingVoice is being processed (typically a few seconds).
readyVoice is ready for TTS generation.
failedCloning failed. Check the error message.

Voice Expiration

Cloned voices expire after 7 days of inactivity. Verbatik automatically sends keep-alive requests for actively used voices. If a voice expires, you'll need to clone it again.

Uploading Audio

If your audio isn't hosted at a public URL, upload it first:

POST /api/audio-upload

Upload as multipart form data. The response includes a URL for the audio_url field.

Tips for Best Results

  1. Use clear, high-quality audio — Record in a quiet environment.
  2. Speak naturally — Natural speech patterns produce better clones.
  3. Provide at least 10 seconds — Longer samples (30–60 seconds) generally produce better results.
  4. Enable noise reduction — If your audio has background noise.
  5. Use volume normalization — Helps with inconsistent audio levels.