Voice Cloning
Clone a voice from an audio sample and generate speech with full control over speed, pitch, emotion, and more.
Voice Cloning
Clone a custom voice from an audio sample, then use it to generate speech with full control over speed, pitch, emotion, and more.
Overview
Voice cloning is a two-step process:
- Clone a voice — Upload an audio sample to create a custom voice ID.
- Generate speech — Use the cloned voice ID to synthesize speech from text.
Step 1: Clone a Voice
Endpoint
Audio Requirements
| Requirement | Value |
|---|---|
| Minimum duration | 10 seconds |
| Maximum file size | 20 MB |
| Supported formats | .mp3, .wav |
| Recommended | Clear speech, minimal background noise |
Request
Headers:
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer YOUR_API_KEY |
Content-Type | Yes | application/json |
Body:
| Field | Required | Type | Description |
|---|---|---|---|
audio_url | Yes | string | URL to the audio file (must be publicly accessible). |
name | No | string | Name for the voice. Auto-generated if not provided. |
noise_reduction | No | boolean | Enable noise reduction. Default: false. |
volume_normalization | No | boolean | Normalize volume levels. Default: false. |
accuracy | No | number | Text validation accuracy (0–1). Default: 0.8. |
preview_text | No | string | Custom text for the preview audio clip. |
Response
Cost: $3.00 per voice.
Step 2: Generate Speech with a Cloned Voice
Endpoint
Headers
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer YOUR_API_KEY |
Content-Type | Yes | text/plain |
X-Voice-ID | Yes | The voice ID from the cloning step. |
X-Store-Audio | No | true to store audio and get a URL. Default: false. |
Voice Settings (optional headers)
| Header | Range | Default | Description |
|---|---|---|---|
X-Speed | 0.5–2.0 | 1 | Speech speed multiplier. |
X-Volume | 0–10 | 1 | Volume level. |
X-Pitch | -12 to 12 | 0 | Pitch adjustment in semitones. |
X-Emotion | See below | — | Emotion of the generated speech. |
X-English-Normalization | true/false | — | Improve number and abbreviation reading. |
Voice Modification (advanced, optional)
| Header | Range | Description |
|---|---|---|
X-Voice-Modify-Pitch | -100 to 100 | Fine-grained pitch adjustment. |
X-Voice-Modify-Intensity | -100 to 100 | Energy/intensity of the voice. |
X-Voice-Modify-Timbre | -100 to 100 | Tonal quality adjustment. |
Audio Settings (optional)
| Header | Values | Default | Description |
|---|---|---|---|
X-Sample-Rate | 8000, 16000, 22050, 24000, 32000, 44100 | 32000 | Sample rate in Hz. |
X-Bitrate | 32000, 64000, 128000, 256000 | 128000 | Audio bitrate. |
X-Format | mp3, pcm, flac | mp3 | Output format. |
X-Language-Boost | Language code | — | Enhance language recognition. |
Body: Plain text to synthesize (max 5,000 characters).
Supported Emotions
happy, sad, angry, fearful, disgusted, surprised, neutral
Special Text Features
Interjection tags — Insert natural speech sounds:
(laughs), (sighs), (coughs), (clears throat), (gasps), (sniffs), (groans), (yawns)
Pause markers — Insert precise pauses:
<#x#> where x is the duration in seconds (0.01–99.99). Example: Hello <#1.5#> world
Pricing
$0.08 per 1,000 characters ($80 per 1M characters).
Example
Managing Cloned Voices
List Your Cloned Voices
| Parameter | Type | Description |
|---|---|---|
status | string | Filter: pending, ready, or failed. |
Voice Statuses
| Status | Description |
|---|---|
pending | Voice is being processed (typically a few seconds). |
ready | Voice is ready for TTS generation. |
failed | Cloning failed. Check the error message. |
Voice Expiration
Cloned voices expire after 7 days of inactivity. Verbatik automatically sends keep-alive requests for actively used voices. If a voice expires, you'll need to clone it again.
Uploading Audio
If your audio isn't hosted at a public URL, upload it first:
Upload as multipart form data. The response includes a URL for the audio_url field.
Tips for Best Results
- Use clear, high-quality audio — Record in a quiet environment.
- Speak naturally — Natural speech patterns produce better clones.
- Provide at least 10 seconds — Longer samples (30–60 seconds) generally produce better results.
- Enable noise reduction — If your audio has background noise.
- Use volume normalization — Helps with inconsistent audio levels.