Voiceover Audio Clips

Level: Beginner

Language: Python

Gen AI API: OpenAI

Requirements:

Code: Voiceover Audio Clips GitHub Repository

Tutorial

Recording voiceovers for instructional and demo videos can take hours - and sometimes even days if you’re a professional at having to start over! But wouldn’t it be great if you could get the job done at 10x the speed and perfect on the first try? This sample provides the basic logic to convert text to speech using the OpenAI Audio API’s speech endpoint. Gone are the days of hours at the microphone!

Create a Script

The script itself is going to be an essential piece to this project given that you’ll pass the words from the script into the prompt for the model to generate an audio clip. My best recommendation is to organize your script in such a way that it could be divided into small chunks of audio clips. For example, it’s significantly more taxing to edit an entire 3 minute voiceover clip as opposed to six separate 30 second clips. Plus, the Audio API only allows a maximum of 4,096 characters.

Also, the model is going to “convert” the written words into speech based how it’s written. Therefore, wherever appropriate, consider writing the script in such a way that’s phonetically spells out what you want the AI to say. For example, my last name (i.e. Speight) is often mispronounced. If I had to include my last name in the script, I may write “Spate” given that it’s almost a full-proof way to avoid mispronouncing my last name.

Create a Virtual Environment

We’ll avoid installing everything globally and instead do our installations in a virtual environment. Enter the following commands in the terminal:

Mac

pip install virtualenv
python3 -m venv venv
source venv/bin/activate

PC

pip install virtualenv
python -m venv venv
.\venv\Scripts\activate

After activating the virtual environment, install the following libraries:

pip install python-dotenv
pip install openai

Create the Python file

Now that we have our virtual environment activated and packages installed, we’ll create the Python file for our code. Name the file voiceover.py and add the following import statements:

from dotenv import load_dotenv
import os
import requests
from openai import OpenAI
from pathlib import Path

Get Your API Keys

Switch over to the browser and retrieve your API key from OpenAI.

Create Environment Variables

Rather than hardcode your API key into the code, we’ll create an .env file to store the key. The environment variable for OpenAI is pretty straight-forward, it’s just your API key.

OPENAI_API_KEY=<YOUR API KEY>

Load the Environment Variable

In the voiceover.py file, load the environment variables so that your code can access the OpenAI key.

load_dotenv()

Create Variables

Let’s create a client variable and assign it to the OpenAI() function. We’ll need this later in our code!

client = OpenAI()

Set the File Path and Name for the File

We’re gonna need a place to save the file once the API returns the response. We’ll save the file in the same folder as the project. As for the file name, I’d suggest using a naming convention that’ll make the editing process as smooth as possible for when you go to use the audio clips for their intended purpose. For example, if I were to generate audio for a tutorial that explains how to create a virtual environment, I would name my clip create-venv-1 followed by the extension. With each iteration of clips that the model generates, I’d manually increase the number at the end of the file name.

As for the file extension, OpenAI provides 6 options: MP3 (default), Opus, AAC, FLAC, WAV, and PCM. The choice is really up to you! OpenAI provides some recommendations. I’d recommend considering the intended purpose of the audio clip before selecting the extension. For example, since I use this endpoint to generate audio clips for YouTube videos, I go with .acc.

speech_file_path = Path(__file__).parent / "name-your-file.aac"

Setup the API Response

We’re going to use the speech endpoint to generate an audio clip based on the words in your script. There’s a couple of parameters we’ll need to assign values.

First up, we have the model. You can choose either tts-1 or tts-1-hd. I prefer to go with hd given that both the quality and latency is better. However, your use case may be OK with using the non-hd model and that’s totally fine!

response = client.audio.speech.create(
  model="tts-1-hd"
)

Next, we need to choose the voice. OpenAI offers a couple of different voices to choose from. I personally like to use Alloy but definitely consider giving the others a try.

response = client.audio.speech.create(
  model="tts-1-hd",
  voice="alloy"
)

The final parameter is the input. The input is where you’ll put the chunk of your script. Remember, the API only allows 4,096 characters.

response = client.audio.speech.create(
  model="tts-1-hd",
  voice="alloy",
  input="This is where you place your script."
)

Call the stream_to_file on Response

The final step in our code is to call the stream_to_file method on the response object and pass in the file path for the generated audio clip. Your editor may give you a warning that stream_to_file is deprecated. I’m not entirely sure what else to put in it’s place given that when I tried with different code, I got a lovely error.

response.stream_to_file(speech_file_path)

Get Your Audio Clip!

Now that everything’s complete, run the code to generate your audio clip. You may find that you’ll want to tweak the script or maybe even try a different voice. OpenAI has noted that it’s currently not possible to change the model’s emotional range - which is a drag. Therefore, there’s no telling where the model may place more emphasis on words over another.

But yanno what? That’s the beauty of generative AI!

Oh, and before I forget, make sure that you provide a clear disclosure to end users that the text-to-speech voice they’re hearing is AI-generated and not a human voice. This disclosure is part of OpenAI’s usage policies!

Previous
Previous

Necessity is the mother of invention. But what breeds innovation?

Next
Next

From Logic to Intuition: Gen AI Impact on Analytical Thinking