AI Speech-to-Speech Voice Conversion for Captain Vagor Using ElevenLabs: Demo 2

Rose Luk
Jul 11
4 min read

Updated: Sep 1

This article provides a breakdown of how I generated Captain Vagor's voice in our AI short film Vessel of Mourning (Gen:48 Edition) using ElevenLabs' Speech-to-Speech technology. The purpose of this article is to provide transparency into my AI-assisted creative process and serve as an educational resource for filmmakers and audio professionals integrating speech-driven voice transformation into their workflows.

Table of Contents:

Designing the Vagor Voice Model

The final voice of Captain Vagor was not selected from ElevenLabs' default offerings but was designed from scratch using the platform's Voice Design tool. The Voice Design tool allows creators to engineer a voice from the ground up by entering descriptive prompts and refining the results through controlled iterations.

I began by generating initial voice candidates using targeted adjectives such as "aged," "gravelly," "ancient," and "resonant." Each generation was evaluated subjectively and through waveform analysis to determine which voices contained the spectral density and harmonic character that matched our creative intent.

Eleven Labs audio upload interface with options to drag files or record. Two model selections: "Eleven English v2" and "Eleven Multilingual v2" shown. — Captain Vagor's voice was created using Eleven Multilingual v2.

I used Eleven Multilingual v2 rather than Eleven English v2 as the voice design base. This was a deliberate choice to ensure compatibility with our fictional language "Senguto". The multilingual model supports more flexible phoneme blending and cross-linguistic prosody, which was critical for capturing the phonotactics required by the character and the fictional world.

ElevenLabs Voice Design settings interface with sliders for Loudness, Quality, and Guidance Scale on a dark background. Quiet and Low to Loud and High scales. — ElevenLabs' main Voice Design controls.

During the voice design process, I utilized ElevenLabs' three main Voice Design controls: Loudness, Quality, and Guidance Scale. Loudness was adjusted to maintain vocal presence without overpowering the mix. Quality was kept high to preserve upper-frequency textures critical to consonant intelligibility and subtle vocal fry. Guidance Scale was finely tuned to ensure a balance between prompt fidelity and creative variance across generations. The final Vagor voice was selected after about 4-5 iterations, each informed by previous outcomes and refined prompt inputs.

Speech Capture and Input Pipeline

We used a solid-state condenser microphone combined with a 24-bit / 192kHz audio interface to record my original vocal performance. This configuration ensured high-resolution capture of my timing, dynamic range, and expressive nuance without introducing unwanted coloration. In my original vocal performance, I imitated what I wanted Vagor’s voice to sound like in terms of delivery, accuracy, timing, and intention.

AI Speech-to-Speech Conversion Process

I uploaded the raw recording into ElevenLabs' Speech-to-Speech interface and selected the custom-designed Vagor voice model. The ElevenLabs system extracted the prosodic structure and phonetic articulation from the actor’s performance and re-synthesized it using the spectral and temporal characteristics of the AI model.

Under the hood, ElevenLabs Speech-to-Speech employs autoregressive modeling and neural vocoding to re-render the waveform. It preserves timing and emotional contour while replacing the vocal signature with the target model. Unlike text-to-speech, which must generate both speech and timing from textual input, speech-to-speech preserves actor-driven rhythm, which is crucial for cinematic pacing.

The final audio output was exported directly without modification. This unprocessed audio is what listeners hear in the demo video. It represents a clean benchmark of the ElevenLabs output quality when applied to professionally recorded raw input and custom voice models.

Evaluation of AI Output

The AI-generated output was analyzed against several benchmarks:

Fidelity: The voice was free of aliasing, transient smearing, or harmonic distortion. The noise floor was low, and no digital artifacts were present.
Identity Consistency: The voice retained consistent timbral identity across multiple phrases, with no drift in formant structure or voicing.
Prosodic Alignment: Timing and phrasing matched the human source precisely. Cadence and breath spacing were preserved without slurring or lag.
Spectral Quality: Frequency analysis showed strong low-end presence with intelligible midrange and well-preserved high-end consonants.

These results confirmed that the Speech-to-Speech system was capable of transforming performance while maintaining professional-grade fidelity suitable for narrative short film use.

Role of Post-Processing in the Final Film

In Vessel of Mourning (Gen:48 Edition), Captain Vagor's AI-generated voice was integrated with the score, ambience, and environmental sound design. Mixing included EQ, compression, spatial reverberation, and automated gain riding to maintain vocal balance within complex scenes.

Isolating Vagor's AI-generated voice in the demo video provides a clean example of what ElevenLabs Speech-to-Speech technology can deliver without further processing. This supports reproducibility, technical benchmarking, and honest evaluation of the AI’s role in production.

ElevenLabs interface showing a "Voice Design" panel with a prompt box, randomize button, and loudness slider. Bold text and grey background. — ElevenLabs Voice Design tool.

Conclusion

Captain Vagor’s voice in Vessel of Mourning: Gen:48 Edition is the result of a complete AI voice workflow: prompt-based voice identity design, human speech input, and transformation through ElevenLabs’ Speech-to-Speech tool. No post-processing was applied to either the input or output in the video demo. The custom Vagor voice model was generated using ElevenLabs' Multilingual v2 and refined with manual prompt engineering as well as slider adjustments to fit a fictional linguistic and tonal environment.

As AI continues to expand the boundaries of character design and voice performance, I, along with the rest of the team at Lyone Media Group, remain committed to transparency and accountability. Speech-to-Speech voice conversion enables consistent character identity across productions while preserving human expressiveness. By documenting my methodology, I aim to set clear standards for ethical, high-fidelity use of AI in cinematic storytelling.

Future posts will explore additional characters, voice model design variations, and cross-modal integrations as part of our growing library of AI-powered cinematic tools.