AI Speech-to-Speech Voice Conversion for Captain Vagor Using ElevenLabs: Demo 1

Rose Luk
Jun 17
4 min read

Updated: Sep 1

As artificial intelligence-driven media workflows become increasingly common, transparency around how these tools are used is essential. In this post, we'll look at how I used ElevenLabs' Speech-to-Speech AI voice technology in one of our recent content marketing videos.

Table of Contents:

Voice Model Design
Source Audio Recording
AI Speech-to-Speech Conversion Process
Output Handling
Performance Evaluation
Use Case Transparency
AI Speech-to-Speech Production Benefits
Conclusion

Voice Model Design

Before this project, I had already created a custom AI voice model that matched our target creative criteria for Captain Vagor: deep resonance, aged vocal texture, slight inhuman timbral qualities, and consistent intelligibility. It was created via text-based prompt engineering and iterative refinement with ElevenLabs’ Voice Design tool. This voice model is now part of our standardized voice assets for the Lyone Tropic fictional universe and will be used across multiple content pieces to ensure character continuity.

Source Audio Recording

The source audio used in this specific demo was recorded as part of a short-form vertical video production. The original recording was performed by me, delivering Vagor’s dialogue in Modern Albbonyan (English) using an artificial accent designed to mimic how non-native Modern Albbonyan speakers from Jaxol and Eave sound when speaking the language.

We captured this recording using a solid-state condenser microphone connected to a 24-bit / 192kHz audio interface. No post-processing was applied before the AI conversion process. The captured audio remained raw, with no equalization, compression, noise reduction, or dynamic range processing applied.

AI Speech-to-Speech Conversion Process

It is worth noting that speech-to-speech voice conversion is distinct from traditional text-to-speech synthesis. Instead of generating speech from written text, the system takes a human speech recording as input. It uses this input to extract prosodic features, timing, and expression, while re-rendering the acoustic output using the target voice model’s spectral characteristics.

In technical terms, ElevenLabs Speech-to-Speech employs deep neural networks trained to perform high-fidelity voice conversion. It preserves the original temporal structure and emotional delivery while performing voice identity substitution. The model uses techniques such as autoregressive modeling, prosody transfer, and neural vocoding to produce realistic speech in the target voice.

For this video, we uploaded the raw human recording directly into the ElevenLabs platform, selected the pre-designed Vagor voice model, and ran the speech-to-speech conversion pipeline.

Output Handling

The output of the speech-to-speech conversion was exported in WAV format. The voice used in the final published social media video was the direct speech-to-speech output with no additional EQ, compression, reverb, or enhancement. This allows viewers and listeners to hear an unaltered example of the current state of ElevenLabs Speech-to-Speech technology.

Performance Evaluation

From a technical perspective, the speech-to-speech conversion process met key quality benchmarks.

First, it accurately preserved the timing, phrasing, and expressive dynamics of the original performance. Prosodic alignment was extremely close between the recording of my voice and the AI output.

Second, the timbral transformation was complete and consistent with the designed Captain Vagor voice model. The system effectively altered the spectral envelope, harmonic content, and vocal tract characteristics to match the AI voice identity without distorting intelligibility or introducing phonetic errors. There was no evidence of voice blending or identity leakage from the source speaker. The resulting audio maintained a consistent voice signature across multiple phrases and tone shifts.

Third, we observed no aliasing artifacts, pitch instability, or transient smearing. Common failure modes in legacy or lower-resolution voice conversion systems (such as robotic modulation, prosody drift, or envelope flattening) did not occur. ElevenLabs' neural vocoding performed at a quality sufficient for direct use in commercial media.

Fourth, the system’s robustness to raw input was evident. Even without normalization, gating, or denoising, the model preserved speech integrity without amplifying room tone or background noise. This confirms the model’s resilience and adaptability to professionally recorded but untreated source material.

Use Case Transparency

My goal in documenting this process is to provide a transparent look at how AI voice tools are actively being used in creative pipelines. There is often a misconception that AI-generated voices are either entirely automated or entirely artificial. In reality, modern pipelines like ours involve a hybrid process: creative voice design using prompt engineering, human performance capture using high-fidelity gear, and speech-to-speech transformation via deep learning inference.

In this example, the final voice result was not written as text. It originated from my voice, which was then mapped onto a digital character voice using a purpose-built model. The AI's role was transformational, not generative in the conventional sense. This enables a level of nuance and expressiveness that is difficult to achieve with text-to-speech alone, while retaining full control over vocal identity and consistency.

AI Speech-to-Speech Production Benefits

For production environments focused on narrative consistency and rapid iteration, speech-to-speech workflows offer distinct advantages:

Character Voice Consistency: Once a voice model is designed, it can be used across many scenes and projects, regardless of which actor records the base dialogue. This decouples vocal identity from vocal performance.
Performance Flexibility: Directors can cast performers based on their emotional range and dramatic timing, rather than a vocal match to a character. AI handles the identity transformation without limiting performance dynamics.
Post-Production Efficiency: Because the system operates effectively on raw audio and delivers clean output, it eliminates large portions of the traditional voice processing chain. Turnaround times are significantly reduced.
Localization and Reusability: Future voiceovers or translations can reuse the same model, preserving continuity across languages or formats without requiring re-recording from the voice actor or a complete retuning of the model.

Conclusion

This voice demo demonstrates the practical use of speech-to-speech voice conversion as a production tool. We recorded clean, high-resolution human speech using a solid-state condenser microphone and a 24-bit / 192kHz interface. That speech was then transformed using ElevenLabs' Speech-to-Speech tool, applied to a pre-designed voice model for Captain Vagor, with no additional post-processing.

By publishing both the original recording and the synthesized output, we aim to set a standard for transparency in the use of AI-generated voices. I believe open documentation of our methods is essential for building trust in the tools we use and advancing best practices in the creative AI field.

Future demos will explore similar workflows using alternative models, performance conditions, and character voices from the Lyone Tropic fictional universe.