Experiments in Life

How to Create Custom Language Learning Audio with AI

There are lot of great language learning audio resources out there. Here are a few that I use for Japanese:

Sometimes, however, I want custom audio. There are topics I want to be able to talk about and specific sentences I want to be able to say. Having an audio file to drill me on these is helpful.

This is my method of creating custom audio for Japanese:

First, download VoiceVox. This is an AI model that runs on-device and creates good Japanese audio from Japanese text. I use the CPU-only version because I create these files on my laptop which does not have a powerful GPU. All of the program menus are in Japanese, so I use my phone camera to translate.

If you are learning a different language, you will have to search around for a different program. ElevenLabs has great voices for a lot of languages. You can ask an AI to adapt the code below to use it.

After figuring out what program you will use to generate audio from text, ChatGPT or Claude can be used to create text in your target language. You can type in the sentences you want to be able to say or ask these models to suggest sentences appropriate to certain topics. I usually provide a few sentences and ask the model to generate more in the same vein.

Next, I edit this Python script in my IDE with my custom sentences. The VoiceVox API used here is the default for a local installation. You can give this code to the AI you're using and ask it to edit the sentences for you or give you the sentences formatted correctly to paste into the code. You can also ask an AI to adapt the code for Mac or Linux if you are not using Windows.

This code contains some options you can change, like voice selection, file output format, and whether or not to include beeps. The script performs some basic audio equalization and includes example beginner sentences that you can change out.

If you don't have Python, you will need to install it to use this script. If this is all new to you, ask an AI to walk you through it. This is an area where AIs are very well trained.

# Windows-friendly: VOICEVOX (JP) + pyttsx3/SAPI (EN) + pydub normalize & combine

import os
import io
import time
import tempfile
import requests
from pydub import AudioSegment, effects, generators
import pyttsx3
from datetime import datetime


# =========================
# CONFIG
# =========================
VOICEVOX_HOST = "http://127.0.0.1:50021"
JP_SPEAKER_ID = 2             # Try different IDs (1, 2, 3...). Change to taste.
JP_SPEED_SCALE = 1.0          # 0.5 ~ 2.0 (VOICEVOX audio_query param)
JP_PITCH_SCALE = 0.0          # -0.15 ~ 0.15 typical
JP_INTONATION_SCALE = 1.0     # 0 ~ 2
JP_VOLUME_SCALE = 1.0         # 0 ~ 2
JP_PRE_PHONEME_LENGTH = 0.1
JP_POST_PHONEME_LENGTH = 0.1

EN_RATE_WPM = 165             # English speaking rate (SAPI via pyttsx3)
EN_VOLUME = 1.0               # 0.0 ~ 1.0
EN_VOICE_SUBSTR = "Zira"        # which Windows voice to pick (substring match, e.g. "Zira", "en-US", "Guy")

TARGET_SAMPLE_RATE = 24000    # Final sample rate for consistency (VOICEVOX default is often 24000)
PAD_MS = 400                  # silence between segments
BEEP_MS = 0               # separator beep length (set to 0 to disable beep)
BEEP_HZ = 660

EXPORT_FORMAT = "mp3"         # "wav" or "mp3"
OUT_DIR = "output"
STAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
OUT_FILE = f"jp_en_combo_{STAMP}.{EXPORT_FORMAT}"


# =========================
# Your bilingual lines
# =========================
# Each tuple: (Japanese text, English text)
LINES = [
    ("γŠε…ƒζ°—γ§γ™γ‹οΌŸ", "How are you?"),
    ("元気です。あγͺたは?", "I am well. And yourself?"),
    ("ε…ƒζ°—γ§γ™γ€‚γ‚γ‚ŠγŒγ¨γ†γ€‚", "I am well, thank you."),
    ("はい、もけろん。", "Yes, certainly."),
    ("γγ‚Œγ―ι’η™½γ„γ§γ™γ­γ€‚", "That is interesting."),
    ("ζœ¬ε½“γ§γ™γ‹οΌŸ", "Really?"),
    ("硢対に。", "Definitely."),
    ("γ‚ˆγεˆ†γ‹γ‚ŠγΎγ›γ‚“γ€‚", "I'm not sure."),
    ("γ™γΏγΎγ›γ‚“γ€γ‚‚γ†ε°‘γ—γ‚†γ£γγ‚Šθ©±γ—γ¦γ‚‚γ‚‰γˆγΎγ™γ‹οΌŸ", "Sorry, can you please speak more slowly?"),
    ("わあ!すごいですね。", "Wow! That's great."),
        # --- Everyday reactions ---
    ("γͺるほど。", "I see."),
    ("εˆ†γ‹γ‚ŠγΎγ—γŸγ€‚", "I understand."),
    ("εˆ†γ‹γ‚ŠγΎγ›γ‚“γ€‚", "I don't understand."),
    ("そうです。", "That's right."),
    ("γγ†γ˜γ‚ƒγͺいです。", "That's not right."),
    ("私もそう思います。", "I think so too."),
    ("γŸγΆγ‚“γ€‚", "Maybe."),

    # --- Polite interaction ---
    ("すみません。", "Excuse me."),
    ("γ‚γ‚ŠγŒγ¨γ†γ”γ–γ„γΎγ™γ€‚", "Thank you very much."),
    ("γ©γ†γ„γŸγ—γΎγ—γ¦γ€‚", "You're welcome."),
    ("γŠι‘˜γ„γ—γΎγ™γ€‚", "Please."),
    ("ε€§δΈˆε€«γ§γ™γ€‚", "It's okay."),
    ("γ―γ˜γ‚γΎγ—γ¦γ€‚", "Nice to meet you."),
    ("γ•γ‚ˆγ†γͺら。", "Goodbye."),
    ("γΎγŸγ­γ€‚", "See you later."),

    # --- Feelings & impressions ---
    ("ζ₯½γ—いです。", "That's fun."),
    ("かわいいですね。", "That's cute."),
    ("γγ‚Œγ„γ§γ™γ­γ€‚", "That's beautiful."),
    ("すごいですね!", "That's amazing!"),
    ("ε₯½γγ§γ™γ€‚", "I like it."),
    ("ε₯½γγ˜γ‚ƒγͺいです。", "I don't like it."),
    ("η–²γ‚ŒγΎγ—γŸγ€‚", "I'm tired."),
    ("おγͺγ‹γŒγ™γγΎγ—γŸγ€‚", "I'm hungry."),
    ("おγͺγ‹γŒγ„γ£γ±γ„γ§γ™γ€‚", "I'm full."),

    # --- Practical daily use ---
    ("γƒˆγ‚€γƒ¬γ―γ©γ“γ§γ™γ‹οΌŸ", "Where is the bathroom?"),
    ("γ„γγ‚‰γ§γ™γ‹οΌŸ", "How much is it?"),
    ("δ½•ζ™‚γ§γ™γ‹οΌŸ", "What time is it?"),
    ("ε°‘γ€…γŠεΎ…γ‘γγ γ•γ„γ€‚", "Please wait a moment."),
    ("εˆ†γ‹γ‚ŠγΎγ›γ‚“γ€‚", "I don't know."),
    ("ζ‰‹δΌγ£γ¦γ‚‚γ‚‰γˆγΎγ™γ‹οΌŸ", "Could you help me?"),
    ("ι“γ«θΏ·γ„γΎγ—γŸγ€‚", "I'm lost."),
]

# =========================
# Helpers
# =========================

def _voicevox_tts(text: str, speaker: int) -> AudioSegment:
    """Synthesize Japanese via VOICEVOX engine -> AudioSegment (mono, TARGET_SAMPLE_RATE)."""
    # 1) audio_query
    params = {
        "text": text,
        "speaker": speaker,
        "speedScale": JP_SPEED_SCALE,
        "pitchScale": JP_PITCH_SCALE,
        "intonationScale": JP_INTONATION_SCALE,
        "volumeScale": JP_VOLUME_SCALE,
        "prePhonemeLength": JP_PRE_PHONEME_LENGTH,
        "postPhonemeLength": JP_POST_PHONEME_LENGTH,
    }
    # VOICEVOX expects query in POST /audio_query?text=...&speaker=...
    # Some engines accept JSON body; safest is query string:
    aq = requests.post(
        f"{VOICEVOX_HOST}/audio_query",
        params={"text": text, "speaker": speaker},
        timeout=10
    )
    aq.raise_for_status()
    query = aq.json()

    # override query fields with our config if present
    query["speedScale"] = JP_SPEED_SCALE
    query["pitchScale"] = JP_PITCH_SCALE
    query["intonationScale"] = JP_INTONATION_SCALE
    query["volumeScale"] = JP_VOLUME_SCALE
    query["prePhonemeLength"] = JP_PRE_PHONEME_LENGTH
    query["postPhonemeLength"] = JP_POST_PHONEME_LENGTH

    # 2) synthesis
    syn = requests.post(
        f"{VOICEVOX_HOST}/synthesis",
        params={"speaker": speaker},
        json=query,
        timeout=30
    )
    syn.raise_for_status()
    wav_bytes = syn.content

    seg = AudioSegment.from_file(io.BytesIO(wav_bytes), format="wav")
    seg = seg.set_channels(1).set_frame_rate(TARGET_SAMPLE_RATE)
    return seg

def _pyttsx3_to_wav(text: str, rate_wpm: int, volume: float, voice_substr: str) -> AudioSegment:
    """Synthesize English via Windows SAPI using pyttsx3 -> AudioSegment."""
    engine = pyttsx3.init()
    # voice
    chosen_voice_id = None
    for v in engine.getProperty("voices"):
        # Choose the first voice that contains the substring (e.g., 'en', 'Zira', 'en-US')
        if voice_substr.lower() in (v.name + " " + v.id).lower():
            chosen_voice_id = v.id
            break
    if chosen_voice_id:
        engine.setProperty("voice", chosen_voice_id)

    engine.setProperty("rate", rate_wpm)
    engine.setProperty("volume", volume)

    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tf:
        tmp_path = tf.name

    # Synthesize to file
    engine.save_to_file(text, tmp_path)
    engine.runAndWait()
    engine.stop()

    # Load and resample
    seg = AudioSegment.from_file(tmp_path, format="wav")
    seg = seg.set_channels(1).set_frame_rate(TARGET_SAMPLE_RATE)

    # Clean up temp
    try:
        os.remove(tmp_path)
    except OSError:
        pass

    return seg

def _normalize_peak(seg: AudioSegment, target_dbfs=-1.0) -> AudioSegment:
    """Peak normalize so that max peak ~ target_dbfs."""
    change = target_dbfs - seg.max_dBFS
    return seg.apply_gain(change)

def _separator_beep(ms=BEEP_MS, freq=BEEP_HZ) -> AudioSegment:
    if ms <= 0:
        return AudioSegment.silent(duration=0)
    tone = generators.Sine(freq).to_audio_segment(duration=ms)
    tone = tone.set_frame_rate(TARGET_SAMPLE_RATE).set_channels(1)
    # Gentle -6 dB so it’s not harsh
    tone = tone.apply_gain(-6.0)
    return tone

# =========================
# Build the combined track
# =========================

def main():
    os.makedirs(OUT_DIR, exist_ok=True)

    silence = AudioSegment.silent(duration=PAD_MS)

    # Lead-in (keep OUTSIDE the loop)
    combined = AudioSegment.silent(duration=400)

    # --- Practice pause lengths (ms) ---
    PRACTICE_PAUSE_MS_1 = 2500   # pause after first JP (for you to repeat)
    PRACTICE_PAUSE_MS_2 = 2500   # pause after second JP (repeat again)
    practice_pause_1 = AudioSegment.silent(duration=PRACTICE_PAUSE_MS_1)
    practice_pause_2 = AudioSegment.silent(duration=PRACTICE_PAUSE_MS_2)

    print(f"Synthesizing {len(LINES)} JP↔EN pairs... Make sure VOICEVOX is running.")

    for i, (jp_text, en_text) in enumerate(LINES, start=1):
        print(f"[{i}] JP: {jp_text}")
        jp_seg = _voicevox_tts(jp_text, JP_SPEAKER_ID)
        jp_seg = _normalize_peak(jp_seg, -2.0)

        print(f"[{i}] EN: {en_text}")
        en_seg = _pyttsx3_to_wav(en_text, EN_RATE_WPM, EN_VOLUME, EN_VOICE_SUBSTR)
        en_seg = _normalize_peak(en_seg, -2.0)

        # JP β†’ pause β†’ EN β†’ JP (repeat) β†’ pause β†’ small gap
        # Reuse the same jp_seg for the repeat (saves time and keeps prosody identical)
        pair_block = (
            jp_seg
            + practice_pause_1
            + silence
            + en_seg
            + silence
            + jp_seg            # repeat Japanese
            + practice_pause_2
            + silence
        )

        combined += pair_block
        print(f"    pair {i} length: {len(pair_block)/1000:.2f}s | total: {len(combined)/1000:.2f}s")
        time.sleep(0.05)  # tiny yield to avoid TTS hiccups

    # Final normalize & export
    combined = effects.normalize(combined)

    out_path = os.path.join(OUT_DIR, OUT_FILE)
    if EXPORT_FORMAT.lower() == "mp3":
        combined.export(out_path, format="mp3", bitrate="192k")
    else:
        combined.export(out_path, format="wav")

    print(f"Done! Exported: {out_path}")


if __name__ == "__main__":
    main()

If you decide that you'd prefer different pause lengths or a different number of repetitions, you can edit the code yourself or ask an AI to do it.

There may be some python packages used here that you need to install using the command line/terminal.

Package Purpose Install command
requests Connects to the local VOICEVOX engine API (http://127.0.0.1:50021) pip install requests
pydub Handles audio combining, silence padding, normalization, export pip install pydub
pyttsx3 Uses Windows’ built-in SAPI voices for English text-to-speech pip install pyttsx3
ffmpeg (external tool) Required by pydub to read/export MP3/WAV install separately (see below)

I also had to install audioop-lts: pip install audioop-lts

You could use this one-liner to make installing the Python packages fast: pip /install requests pydub pyttsx3 audioop-lts

This is how I installed ffmpeg: winget install Gyan.FFmpeg

Once you have the script, your sentences, and all the required tools ready to go, run the script in your terminal. Voila, custom audio!

To make new audio, all you have to do is change the sentences in the script. I hope this is helpful to you. Enjoy!