How to Create Custom Language Learning Audio with AI

06 Oct, 2025

There are lot of great language learning audio resources out there. Here are a few that I use for Japanese:

Pimsleur
Nihongo con Teppei (free)
Japanese with Shunn (free)
2000 Most Common Japanese Words in Context

Sometimes, however, I want custom audio. There are topics I want to be able to talk about and specific sentences I want to be able to say. Having an audio file to drill me on these is helpful.

This is my method of creating custom audio for Japanese:

First, download VoiceVox. This is an AI model that runs on-device and creates good Japanese audio from Japanese text. I use the CPU-only version because I create these files on my laptop which does not have a powerful GPU. All of the program menus are in Japanese, so I use my phone camera to translate.

If you are learning a different language, you will have to search around for a different program. ElevenLabs has great voices for a lot of languages. You can ask an AI to adapt the code below to use it.

After figuring out what program you will use to generate audio from text, ChatGPT or Claude can be used to create text in your target language. You can type in the sentences you want to be able to say or ask these models to suggest sentences appropriate to certain topics. I usually provide a few sentences and ask the model to generate more in the same vein.

Next, I edit this Python script in my IDE with my custom sentences. The VoiceVox API used here is the default for a local installation. You can give this code to the AI you're using and ask it to edit the sentences for you or give you the sentences formatted correctly to paste into the code. You can also ask an AI to adapt the code for Mac or Linux if you are not using Windows.

This code contains some options you can change, like voice selection, file output format, and whether or not to include beeps. The script performs some basic audio equalization and includes example beginner sentences that you can change out.

If you don't have Python, you will need to install it to use this script. If this is all new to you, ask an AI to walk you through it. This is an area where AIs are very well trained.

# Windows-friendly: VOICEVOX (JP) + pyttsx3/SAPI (EN) + pydub normalize & combine

import os
import io
import time
import tempfile
import requests
from pydub import AudioSegment, effects, generators
import pyttsx3
from datetime import datetime


# =========================
# CONFIG
# =========================
VOICEVOX_HOST = "http://127.0.0.1:50021"
JP_SPEAKER_ID = 2             # Try different IDs (1, 2, 3...). Change to taste.
JP_SPEED_SCALE = 1.0          # 0.5 ~ 2.0 (VOICEVOX audio_query param)
JP_PITCH_SCALE = 0.0          # -0.15 ~ 0.15 typical
JP_INTONATION_SCALE = 1.0     # 0 ~ 2
JP_VOLUME_SCALE = 1.0         # 0 ~ 2
JP_PRE_PHONEME_LENGTH = 0.1
JP_POST_PHONEME_LENGTH = 0.1

EN_RATE_WPM = 165             # English speaking rate (SAPI via pyttsx3)
EN_VOLUME = 1.0               # 0.0 ~ 1.0
EN_VOICE_SUBSTR = "Zira"        # which Windows voice to pick (substring match, e.g. "Zira", "en-US", "Guy")

TARGET_SAMPLE_RATE = 24000    # Final sample rate for consistency (VOICEVOX default is often 24000)
PAD_MS = 400                  # silence between segments
BEEP_MS = 0               # separator beep length (set to 0 to disable beep)
BEEP_HZ = 660

EXPORT_FORMAT = "mp3"         # "wav" or "mp3"
OUT_DIR = "output"
STAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
OUT_FILE = f"jp_en_combo_{STAMP}.{EXPORT_FORMAT}"


# =========================
# Your bilingual lines
# =========================
# Each tuple: (Japanese text, English text)
LINES = [
    ("お元気ですか？", "How are you?"),
    ("元気です。あなたは？", "I am well. And yourself?"),
    ("元気です。ありがとう。", "I am well, thank you."),
    ("はい、もちろん。", "Yes, certainly."),
    ("それは面白いですね。", "That is interesting."),
    ("本当ですか？", "Really?"),
    ("絶対に。", "Definitely."),
    ("よく分かりません。", "I'm not sure."),
    ("すみません、もう少しゆっくり話してもらえますか？", "Sorry, can you please speak more slowly?"),
    ("わあ！すごいですね。", "Wow! That's great."),
        # --- Everyday reactions ---
    ("なるほど。", "I see."),
    ("分かりました。", "I understand."),
    ("分かりません。", "I don't understand."),
    ("そうです。", "That's right."),
    ("そうじゃないです。", "That's not right."),
    ("私もそう思います。", "I think so too."),
    ("たぶん。", "Maybe."),

    # --- Polite interaction ---
    ("すみません。", "Excuse me."),
    ("ありがとうございます。", "Thank you very much."),
    ("どういたしまして。", "You're welcome."),
    ("お願いします。", "Please."),
    ("大丈夫です。", "It's okay."),
    ("はじめまして。", "Nice to meet you."),
    ("さようなら。", "Goodbye."),
    ("またね。", "See you later."),

    # --- Feelings & impressions ---
    ("楽しいです。", "That's fun."),
    ("かわいいですね。", "That's cute."),
    ("きれいですね。", "That's beautiful."),
    ("すごいですね！", "That's amazing!"),
    ("好きです。", "I like it."),
    ("好きじゃないです。", "I don't like it."),
    ("疲れました。", "I'm tired."),
    ("おなかがすきました。", "I'm hungry."),
    ("おなかがいっぱいです。", "I'm full."),

    # --- Practical daily use ---
    ("トイレはどこですか？", "Where is the bathroom?"),
    ("いくらですか？", "How much is it?"),
    ("何時ですか？", "What time is it?"),
    ("少々お待ちください。", "Please wait a moment."),
    ("分かりません。", "I don't know."),
    ("手伝ってもらえますか？", "Could you help me?"),
    ("道に迷いました。", "I'm lost."),
]

# =========================
# Helpers
# =========================

def _voicevox_tts(text: str, speaker: int) -> AudioSegment:
    """Synthesize Japanese via VOICEVOX engine -> AudioSegment (mono, TARGET_SAMPLE_RATE)."""
    # 1) audio_query
    params = {
        "text": text,
        "speaker": speaker,
        "speedScale": JP_SPEED_SCALE,
        "pitchScale": JP_PITCH_SCALE,
        "intonationScale": JP_INTONATION_SCALE,
        "volumeScale": JP_VOLUME_SCALE,
        "prePhonemeLength": JP_PRE_PHONEME_LENGTH,
        "postPhonemeLength": JP_POST_PHONEME_LENGTH,
    }
    # VOICEVOX expects query in POST /audio_query?text=...&speaker=...
    # Some engines accept JSON body; safest is query string:
    aq = requests.post(
        f"{VOICEVOX_HOST}/audio_query",
        params={"text": text, "speaker": speaker},
        timeout=10
    )
    aq.raise_for_status()
    query = aq.json()

    # override query fields with our config if present
    query["speedScale"] = JP_SPEED_SCALE
    query["pitchScale"] = JP_PITCH_SCALE
    query["intonationScale"] = JP_INTONATION_SCALE
    query["volumeScale"] = JP_VOLUME_SCALE
    query["prePhonemeLength"] = JP_PRE_PHONEME_LENGTH
    query["postPhonemeLength"] = JP_POST_PHONEME_LENGTH

    # 2) synthesis
    syn = requests.post(
        f"{VOICEVOX_HOST}/synthesis",
        params={"speaker": speaker},
        json=query,
        timeout=30
    )
    syn.raise_for_status()
    wav_bytes = syn.content

    seg = AudioSegment.from_file(io.BytesIO(wav_bytes), format="wav")
    seg = seg.set_channels(1).set_frame_rate(TARGET_SAMPLE_RATE)
    return seg

def _pyttsx3_to_wav(text: str, rate_wpm: int, volume: float, voice_substr: str) -> AudioSegment:
    """Synthesize English via Windows SAPI using pyttsx3 -> AudioSegment."""
    engine = pyttsx3.init()
    # voice
    chosen_voice_id = None
    for v in engine.getProperty("voices"):
        # Choose the first voice that contains the substring (e.g., 'en', 'Zira', 'en-US')
        if voice_substr.lower() in (v.name + " " + v.id).lower():
            chosen_voice_id = v.id
            break
    if chosen_voice_id:
        engine.setProperty("voice", chosen_voice_id)

    engine.setProperty("rate", rate_wpm)
    engine.setProperty("volume", volume)

    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tf:
        tmp_path = tf.name

    # Synthesize to file
    engine.save_to_file(text, tmp_path)
    engine.runAndWait()
    engine.stop()

    # Load and resample
    seg = AudioSegment.from_file(tmp_path, format="wav")
    seg = seg.set_channels(1).set_frame_rate(TARGET_SAMPLE_RATE)

    # Clean up temp
    try:
        os.remove(tmp_path)
    except OSError:
        pass

    return seg

def _normalize_peak(seg: AudioSegment, target_dbfs=-1.0) -> AudioSegment:
    """Peak normalize so that max peak ~ target_dbfs."""
    change = target_dbfs - seg.max_dBFS
    return seg.apply_gain(change)

def _separator_beep(ms=BEEP_MS, freq=BEEP_HZ) -> AudioSegment:
    if ms <= 0:
        return AudioSegment.silent(duration=0)
    tone = generators.Sine(freq).to_audio_segment(duration=ms)
    tone = tone.set_frame_rate(TARGET_SAMPLE_RATE).set_channels(1)
    # Gentle -6 dB so it’s not harsh
    tone = tone.apply_gain(-6.0)
    return tone

# =========================
# Build the combined track
# =========================

def main():
    os.makedirs(OUT_DIR, exist_ok=True)

    silence = AudioSegment.silent(duration=PAD_MS)

    # Lead-in (keep OUTSIDE the loop)
    combined = AudioSegment.silent(duration=400)

    # --- Practice pause lengths (ms) ---
    PRACTICE_PAUSE_MS_1 = 2500   # pause after first JP (for you to repeat)
    PRACTICE_PAUSE_MS_2 = 2500   # pause after second JP (repeat again)
    practice_pause_1 = AudioSegment.silent(duration=PRACTICE_PAUSE_MS_1)
    practice_pause_2 = AudioSegment.silent(duration=PRACTICE_PAUSE_MS_2)

    print(f"Synthesizing {len(LINES)} JP↔EN pairs... Make sure VOICEVOX is running.")

    for i, (jp_text, en_text) in enumerate(LINES, start=1):
        print(f"[{i}] JP: {jp_text}")
        jp_seg = _voicevox_tts(jp_text, JP_SPEAKER_ID)
        jp_seg = _normalize_peak(jp_seg, -2.0)

        print(f"[{i}] EN: {en_text}")
        en_seg = _pyttsx3_to_wav(en_text, EN_RATE_WPM, EN_VOLUME, EN_VOICE_SUBSTR)
        en_seg = _normalize_peak(en_seg, -2.0)

        # JP → pause → EN → JP (repeat) → pause → small gap
        # Reuse the same jp_seg for the repeat (saves time and keeps prosody identical)
        pair_block = (
            jp_seg
            + practice_pause_1
            + silence
            + en_seg
            + silence
            + jp_seg            # repeat Japanese
            + practice_pause_2
            + silence
        )

        combined += pair_block
        print(f"    pair {i} length: {len(pair_block)/1000:.2f}s | total: {len(combined)/1000:.2f}s")
        time.sleep(0.05)  # tiny yield to avoid TTS hiccups

    # Final normalize & export
    combined = effects.normalize(combined)

    out_path = os.path.join(OUT_DIR, OUT_FILE)
    if EXPORT_FORMAT.lower() == "mp3":
        combined.export(out_path, format="mp3", bitrate="192k")
    else:
        combined.export(out_path, format="wav")

    print(f"Done! Exported: {out_path}")


if __name__ == "__main__":
    main()

If you decide that you'd prefer different pause lengths or a different number of repetitions, you can edit the code yourself or ask an AI to do it.

There may be some python packages used here that you need to install using the command line/terminal.

Package	Purpose	Install command
requests	Connects to the local VOICEVOX engine API (`http://127.0.0.1:50021`)	`pip install requests`
pydub	Handles audio combining, silence padding, normalization, export	`pip install pydub`
pyttsx3	Uses Windows’ built-in SAPI voices for English text-to-speech	`pip install pyttsx3`
ffmpeg (external tool)	Required by `pydub` to read/export MP3/WAV	install separately (see below)

I also had to install audioop-lts: pip install audioop-lts

You could use this one-liner to make installing the Python packages fast: pip /install requests pydub pyttsx3 audioop-lts

This is how I installed ffmpeg: winget install Gyan.FFmpeg

Once you have the script, your sentences, and all the required tools ready to go, run the script in your terminal. Voila, custom audio!

To make new audio, all you have to do is change the sentences in the script. I hope this is helpful to you. Enjoy!