How to Create Custom Language Learning Audio with AI
There are lot of great language learning audio resources out there. Here are a few that I use for Japanese:
- Pimsleur
- Nihongo con Teppei (free)
- Japanese with Shunn (free)
- 2000 Most Common Japanese Words in Context
Sometimes, however, I want custom audio. There are topics I want to be able to talk about and specific sentences I want to be able to say. Having an audio file to drill me on these is helpful.
This is my method of creating custom audio for Japanese:
First, download VoiceVox. This is an AI model that runs on-device and creates good Japanese audio from Japanese text. I use the CPU-only version because I create these files on my laptop which does not have a powerful GPU. All of the program menus are in Japanese, so I use my phone camera to translate.
If you are learning a different language, you will have to search around for a different program. ElevenLabs has great voices for a lot of languages. You can ask an AI to adapt the code below to use it.
After figuring out what program you will use to generate audio from text, ChatGPT or Claude can be used to create text in your target language. You can type in the sentences you want to be able to say or ask these models to suggest sentences appropriate to certain topics. I usually provide a few sentences and ask the model to generate more in the same vein.
Next, I edit this Python script in my IDE with my custom sentences. The VoiceVox API used here is the default for a local installation. You can give this code to the AI you're using and ask it to edit the sentences for you or give you the sentences formatted correctly to paste into the code. You can also ask an AI to adapt the code for Mac or Linux if you are not using Windows.
This code contains some options you can change, like voice selection, file output format, and whether or not to include beeps. The script performs some basic audio equalization and includes example beginner sentences that you can change out.
If you don't have Python, you will need to install it to use this script. If this is all new to you, ask an AI to walk you through it. This is an area where AIs are very well trained.
# Windows-friendly: VOICEVOX (JP) + pyttsx3/SAPI (EN) + pydub normalize & combine
import os
import io
import time
import tempfile
import requests
from pydub import AudioSegment, effects, generators
import pyttsx3
from datetime import datetime
# =========================
# CONFIG
# =========================
VOICEVOX_HOST = "http://127.0.0.1:50021"
JP_SPEAKER_ID = 2 # Try different IDs (1, 2, 3...). Change to taste.
JP_SPEED_SCALE = 1.0 # 0.5 ~ 2.0 (VOICEVOX audio_query param)
JP_PITCH_SCALE = 0.0 # -0.15 ~ 0.15 typical
JP_INTONATION_SCALE = 1.0 # 0 ~ 2
JP_VOLUME_SCALE = 1.0 # 0 ~ 2
JP_PRE_PHONEME_LENGTH = 0.1
JP_POST_PHONEME_LENGTH = 0.1
EN_RATE_WPM = 165 # English speaking rate (SAPI via pyttsx3)
EN_VOLUME = 1.0 # 0.0 ~ 1.0
EN_VOICE_SUBSTR = "Zira" # which Windows voice to pick (substring match, e.g. "Zira", "en-US", "Guy")
TARGET_SAMPLE_RATE = 24000 # Final sample rate for consistency (VOICEVOX default is often 24000)
PAD_MS = 400 # silence between segments
BEEP_MS = 0 # separator beep length (set to 0 to disable beep)
BEEP_HZ = 660
EXPORT_FORMAT = "mp3" # "wav" or "mp3"
OUT_DIR = "output"
STAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
OUT_FILE = f"jp_en_combo_{STAMP}.{EXPORT_FORMAT}"
# =========================
# Your bilingual lines
# =========================
# Each tuple: (Japanese text, English text)
LINES = [
("γε
ζ°γ§γγοΌ", "How are you?"),
("ε
ζ°γ§γγγγͺγγ―οΌ", "I am well. And yourself?"),
("ε
ζ°γ§γγγγγγ¨γγ", "I am well, thank you."),
("γ―γγγγ‘γγγ", "Yes, certainly."),
("γγγ―ι’η½γγ§γγγ", "That is interesting."),
("ζ¬ε½γ§γγοΌ", "Really?"),
("η΅Άε―Ύγ«γ", "Definitely."),
("γγεγγγΎγγγ", "I'm not sure."),
("γγΏγΎγγγγγε°γγγ£γγθ©±γγ¦γγγγΎγγοΌ", "Sorry, can you please speak more slowly?"),
("γγοΌγγγγ§γγγ", "Wow! That's great."),
# --- Everyday reactions ---
("γͺγγ»γ©γ", "I see."),
("εγγγΎγγγ", "I understand."),
("εγγγΎγγγ", "I don't understand."),
("γγγ§γγ", "That's right."),
("γγγγγͺγγ§γγ", "That's not right."),
("η§γγγζγγΎγγ", "I think so too."),
("γγΆγγ", "Maybe."),
# --- Polite interaction ---
("γγΏγΎγγγ", "Excuse me."),
("γγγγ¨γγγγγΎγγ", "Thank you very much."),
("γ©γγγγγΎγγ¦γ", "You're welcome."),
("γι‘γγγΎγγ", "Please."),
("ε€§δΈε€«γ§γγ", "It's okay."),
("γ―γγγΎγγ¦γ", "Nice to meet you."),
("γγγγͺγγ", "Goodbye."),
("γΎγγγ", "See you later."),
# --- Feelings & impressions ---
("ζ₯½γγγ§γγ", "That's fun."),
("γγγγγ§γγγ", "That's cute."),
("γγγγ§γγγ", "That's beautiful."),
("γγγγ§γγοΌ", "That's amazing!"),
("ε₯½γγ§γγ", "I like it."),
("ε₯½γγγγͺγγ§γγ", "I don't like it."),
("η²γγΎγγγ", "I'm tired."),
("γγͺγγγγγΎγγγ", "I'm hungry."),
("γγͺγγγγ£γ±γγ§γγ", "I'm full."),
# --- Practical daily use ---
("γγ€γ¬γ―γ©γγ§γγοΌ", "Where is the bathroom?"),
("γγγγ§γγοΌ", "How much is it?"),
("δ½ζγ§γγοΌ", "What time is it?"),
("ε°γ
γεΎ
γ‘γγ γγγ", "Please wait a moment."),
("εγγγΎγγγ", "I don't know."),
("ζδΌγ£γ¦γγγγΎγγοΌ", "Could you help me?"),
("ιγ«θΏ·γγΎγγγ", "I'm lost."),
]
# =========================
# Helpers
# =========================
def _voicevox_tts(text: str, speaker: int) -> AudioSegment:
"""Synthesize Japanese via VOICEVOX engine -> AudioSegment (mono, TARGET_SAMPLE_RATE)."""
# 1) audio_query
params = {
"text": text,
"speaker": speaker,
"speedScale": JP_SPEED_SCALE,
"pitchScale": JP_PITCH_SCALE,
"intonationScale": JP_INTONATION_SCALE,
"volumeScale": JP_VOLUME_SCALE,
"prePhonemeLength": JP_PRE_PHONEME_LENGTH,
"postPhonemeLength": JP_POST_PHONEME_LENGTH,
}
# VOICEVOX expects query in POST /audio_query?text=...&speaker=...
# Some engines accept JSON body; safest is query string:
aq = requests.post(
f"{VOICEVOX_HOST}/audio_query",
params={"text": text, "speaker": speaker},
timeout=10
)
aq.raise_for_status()
query = aq.json()
# override query fields with our config if present
query["speedScale"] = JP_SPEED_SCALE
query["pitchScale"] = JP_PITCH_SCALE
query["intonationScale"] = JP_INTONATION_SCALE
query["volumeScale"] = JP_VOLUME_SCALE
query["prePhonemeLength"] = JP_PRE_PHONEME_LENGTH
query["postPhonemeLength"] = JP_POST_PHONEME_LENGTH
# 2) synthesis
syn = requests.post(
f"{VOICEVOX_HOST}/synthesis",
params={"speaker": speaker},
json=query,
timeout=30
)
syn.raise_for_status()
wav_bytes = syn.content
seg = AudioSegment.from_file(io.BytesIO(wav_bytes), format="wav")
seg = seg.set_channels(1).set_frame_rate(TARGET_SAMPLE_RATE)
return seg
def _pyttsx3_to_wav(text: str, rate_wpm: int, volume: float, voice_substr: str) -> AudioSegment:
"""Synthesize English via Windows SAPI using pyttsx3 -> AudioSegment."""
engine = pyttsx3.init()
# voice
chosen_voice_id = None
for v in engine.getProperty("voices"):
# Choose the first voice that contains the substring (e.g., 'en', 'Zira', 'en-US')
if voice_substr.lower() in (v.name + " " + v.id).lower():
chosen_voice_id = v.id
break
if chosen_voice_id:
engine.setProperty("voice", chosen_voice_id)
engine.setProperty("rate", rate_wpm)
engine.setProperty("volume", volume)
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tf:
tmp_path = tf.name
# Synthesize to file
engine.save_to_file(text, tmp_path)
engine.runAndWait()
engine.stop()
# Load and resample
seg = AudioSegment.from_file(tmp_path, format="wav")
seg = seg.set_channels(1).set_frame_rate(TARGET_SAMPLE_RATE)
# Clean up temp
try:
os.remove(tmp_path)
except OSError:
pass
return seg
def _normalize_peak(seg: AudioSegment, target_dbfs=-1.0) -> AudioSegment:
"""Peak normalize so that max peak ~ target_dbfs."""
change = target_dbfs - seg.max_dBFS
return seg.apply_gain(change)
def _separator_beep(ms=BEEP_MS, freq=BEEP_HZ) -> AudioSegment:
if ms <= 0:
return AudioSegment.silent(duration=0)
tone = generators.Sine(freq).to_audio_segment(duration=ms)
tone = tone.set_frame_rate(TARGET_SAMPLE_RATE).set_channels(1)
# Gentle -6 dB so itβs not harsh
tone = tone.apply_gain(-6.0)
return tone
# =========================
# Build the combined track
# =========================
def main():
os.makedirs(OUT_DIR, exist_ok=True)
silence = AudioSegment.silent(duration=PAD_MS)
# Lead-in (keep OUTSIDE the loop)
combined = AudioSegment.silent(duration=400)
# --- Practice pause lengths (ms) ---
PRACTICE_PAUSE_MS_1 = 2500 # pause after first JP (for you to repeat)
PRACTICE_PAUSE_MS_2 = 2500 # pause after second JP (repeat again)
practice_pause_1 = AudioSegment.silent(duration=PRACTICE_PAUSE_MS_1)
practice_pause_2 = AudioSegment.silent(duration=PRACTICE_PAUSE_MS_2)
print(f"Synthesizing {len(LINES)} JPβEN pairs... Make sure VOICEVOX is running.")
for i, (jp_text, en_text) in enumerate(LINES, start=1):
print(f"[{i}] JP: {jp_text}")
jp_seg = _voicevox_tts(jp_text, JP_SPEAKER_ID)
jp_seg = _normalize_peak(jp_seg, -2.0)
print(f"[{i}] EN: {en_text}")
en_seg = _pyttsx3_to_wav(en_text, EN_RATE_WPM, EN_VOLUME, EN_VOICE_SUBSTR)
en_seg = _normalize_peak(en_seg, -2.0)
# JP β pause β EN β JP (repeat) β pause β small gap
# Reuse the same jp_seg for the repeat (saves time and keeps prosody identical)
pair_block = (
jp_seg
+ practice_pause_1
+ silence
+ en_seg
+ silence
+ jp_seg # repeat Japanese
+ practice_pause_2
+ silence
)
combined += pair_block
print(f" pair {i} length: {len(pair_block)/1000:.2f}s | total: {len(combined)/1000:.2f}s")
time.sleep(0.05) # tiny yield to avoid TTS hiccups
# Final normalize & export
combined = effects.normalize(combined)
out_path = os.path.join(OUT_DIR, OUT_FILE)
if EXPORT_FORMAT.lower() == "mp3":
combined.export(out_path, format="mp3", bitrate="192k")
else:
combined.export(out_path, format="wav")
print(f"Done! Exported: {out_path}")
if __name__ == "__main__":
main()
If you decide that you'd prefer different pause lengths or a different number of repetitions, you can edit the code yourself or ask an AI to do it.
There may be some python packages used here that you need to install using the command line/terminal.
Package | Purpose | Install command |
---|---|---|
requests | Connects to the local VOICEVOX engine API (http://127.0.0.1:50021 ) |
pip install requests |
pydub | Handles audio combining, silence padding, normalization, export | pip install pydub |
pyttsx3 | Uses Windowsβ built-in SAPI voices for English text-to-speech | pip install pyttsx3 |
ffmpeg (external tool) | Required by pydub to read/export MP3/WAV |
install separately (see below) |
I also had to install audioop-lts: pip install audioop-lts
You could use this one-liner to make installing the Python packages fast: pip /install requests pydub pyttsx3 audioop-lts
This is how I installed ffmpeg: winget install Gyan.FFmpeg
Once you have the script, your sentences, and all the required tools ready to go, run the script in your terminal. Voila, custom audio!
To make new audio, all you have to do is change the sentences in the script. I hope this is helpful to you. Enjoy!