Building a YouTube Video Translator with OpenAI and ElevenLabs

Introduction

In this tech blog, we will build a YouTube Video translator application using Python, integrating the powerful OpenAI and ElevenLabs APIs for language translation. We will explore the process step by step, incorporating the robust translation capabilities offered by OpenAI and ElevenLabs. Get ready to break down language barriers and unlock a world of multilingual communication possibilities.

Fact: 80% of computer-stored information is in English, while only 17% of the world's population understands it. Translation is a crucial breakthrough that bridges this language gap.

Prerequisites:

Before diving into the code, there are a few prerequisites you need to fulfill:

Python: Ensure that Python is installed on your system. You can download and install Python from the official Python website.
OpenAI API Key: Sign up for an OpenAI account and obtain your API key. https://platform.openai.com/signup
Elevenlabs API Key: Sign up for an Elevenlabs account and obtain your API key. https://beta.elevenlabs.io/sign-up
Install Required Libraries: Install the necessary Python libraries by running the following command in your terminal or command prompt:

pip install requests beautifulsoup4 openai yt-dlp

Install ffmpeg using command:

sudo apt-get install ffmpeg

Building the App:

Step 1: Importing Libraries and Setting Up Constants

import os
import requests
import openai
from yt_dlp import YoutubeDL

YOUTUBE_URL = 'temp_folder/input_audio.m4a'
INPUT_LANGUAGE = 'en'  # English
OUTPUT_LANGUAGE = 'hi'  # Hindi
OUTPUT_AUDIO_VOICE_GENDER = 'male'  # male/female
TEMP_INPUT_AUDIO = 'temp_folder/input_audio.m4a'
TEMP_OUTPUT_AUDIO = 'temp_folder/output_audio.m4a'
TEMP_INPUT_VIDEO = 'temp_folder/input_video.mp4'
TEMP_OUTPUT_VIDEO = 'temp_folder/output_video.mp4'
OPEN_AI_KEY = "" # replace with OpenAI key
ELEVEN_LABS_KEY = "" #replace with ElevenLabs key

Remember to replace both OPEN_AI_KEY and ELEVEN_LABS_KEY with the original keys.

Step 2: Extract the Audio from the Video using `yt-dlp`

Using the yt-dlp library to download audio from a YouTube video. You can customize the format and postprocessors according to your needs.

ydl_opts = {
    'format': 'm4a/bestaudio/best',
    'paths': {'home': TEMP_INPUT_AUDIO.split('/')[0]},
    'outtmpl': {'default': TEMP_INPUT_AUDIO.split('/')[1]},
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'm4a',
    }]
}
with YoutubeDL(ydl_opts) as ydl:
    error_code = ydl.download([YOUTUBE_URL])
    if error_code != 0:
        raise Exception('Failed to download video')

The downloaded audio will be stored in the specified local path, which is defined as a constant TEMP_INPUT_AUDIO.

Step 3: Transcribing Speech using OpenAI's Whisper ASR

Transcribing audio in the original language is recommended for better results. Using different languages may lead to unusual outcomes. I have adjusted the given parameters for reference. For more detailed information, please refer to the OpenAI's create transcription doc.

openai.api_key = OPEN_AI_KEY
transcription_text = openai.Audio.transcribe(
    model="whisper-1",
    file=TEMP_INPUT_AUDIO,
    response_format='text',
    language=INPUT_LANGUAGE,
    temperature=0.3,
    prompt="Please try match speed and transcribe in english"
)

Step 4: Translate text using OpenAI's Completion API

Utilize the transcription_text obtained previously to create a prompt.

Prompts are crucial as they help the model understand the specific requirements for the output. Feel free to experiment and adjust the prompt as needed.

if OUTPUT_AUDIO_VOICE_GENDER == "male":
    prompt = "Below text is transcribed from a youtube video please translate this to hinglish in hindi text as male"
else:
    prompt = "Below text is transcribed from a youtube video please translate this to hinglish in hindi text as female"

prompt += " \n\n {}".format(transcription_text)

Pass the prompt to OpenAI's Completion API:

response = openai.Completion.create(
    engine='text-davinci-003',
    prompt=prompt,
    max_tokens=3000,
    temperature=0.2,
    n=1,
    stop=None,
)
translated_text = response.get('choices')[0]['text']

Note: It's up to you if you want to replace this with another translation API. I tried using the Google Translate API, but it provided a translation that sounded robotic and lacked naturalness in the final output.

Step 5: Text to speech using ElevenLabs API

Eleven Lab offers a Voice Cloning feature that allows you to clone your own voice. However, for the sake of this project, we will be using pre-existing cloned voices of Arnold and Elle.

OUTPUT_AUDIO_VOICE_GENDER variable to determine the corresponding voice ID.

if OUTPUT_AUDIO_VOICE_GENDER == "male":
    voice_id = 'VR6AewLTigWG4xSOukaG'  # arnold
else:
    voice_id = 'MF3mGyEYCl7XYWbV9V6O'  # elli

Call Eleven Labs API using the translated_text along with other parameters:

url = "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}".format(voice_id=voice_id)
headers = {
    "Accept": "audio/mpeg",
    "Content-Type": "application/json",
    "xi-api-key": ELEVEN_LABS_KEY
}
data = {
    "text": translated_text,
    "model_id": "eleven_multilingual_v1",
    "voice_settings": {
        "stability": 0.5,
        "similarity_boost": 0.5
    }
}

response = requests.post(url, json=data, headers=headers)
with open(TEMP_OUTPUT_AUDIO, 'wb') as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)

Now we have the processed audio in TEMP_OUTPUT_AUDIO path.

Final Step: Merge the processed Audio & Video using `ffmpeg`

Just to clarify, in the second step, we have downloaded only the audio file from the YouTube link. In this process, we download the video without any audio. Download video using yt-dlp

with YoutubeDL({
    'format': 'bestvideo[height<=480][ext=mp4]/best[ext=mp4][height<=480]',
    'outtmpl': TEMP_INPUT_VIDEO,
}) as ydl:
    ydl.download([YOUTUBE_URL])

In the final step, attach the processed audio to the downloaded video.

ffmpeg_command = 'ffmpeg -i "{video_path}" -i "{audio_file}" -c:v copy -map 0:v:0 -map 1:a:0 -shortest "{output_path}"'.format(
    video_path=TEMP_INPUT_VIDEO,
    audio_file=TEMP_OUTPUT_AUDIO,
    output_path=TEMP_OUTPUT_VIDEO,
)
os.system(ffmpeg_command)

Here's a breakdown of the command:

-i "{video_path}" specifies the input video file path.
-i "{audio_file}" specifies the input audio file path.
-c:v copy copies the video stream without re-encoding.
-map 0:v:0 selects the first video stream from the first input file.
-map 1:a:0 selects the first audio stream from the second input file.
-shortest ensures that the output duration matches the shortest input stream.
"{output_path}" specifies the output file path.
os.system(ffmpeg_command) this executed and runs the command in the operating system's shell.

Hurray! We now have the final video 🎉

Limitations:

Output audio does not sync second by second with the original audio or lip-syncing.
All background music is removed in the final output.
ElevenLabs has a limited number of language scripts that it understands, including English, French, German, Hindi, Italian, Polish, Portuguese, and Spanish.

PS: This project was part of the Hackathon at Kuku FM, and we snagged the second spot!

For any doubts or edits, please feel free to reach out to me at hello@ratrey.in