I added voice messages to my self-hosted AI agent, for free

Sascha · Сегодня в 00:17

https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwb3whpqn01wtgm49ore2.png

Hello there!

I've been building a small personal AI agent called nevinho. It lives in my Discord DMs and helps me with everyday stuff. Bash commands, web searches, code questions, the usual.

After using it for a while, the friction became obvious. Typing into Discord on my phone while walking or cooking is awful. I wanted to just talk to it.

Discord already has voice messages built in. You hold a button, talk, and it sends an audio attachment. So the job was simple on paper. Receive the audio, turn it into text, feed it to the agent.

This post is the story of how I wired that up without paying for an API.

Why local instead of OpenAI's Whisper API

My first instinct was to call the OpenAI Whisper API. It works, it's accurate, it's cheap.

But this is a personal bot. Every dependency I add is one more thing that can break, expire, or cost money I forgot about. I already had API keys for the LLM. I didn't want another bill, another secret to rotate, another vendor lock.

There's also a nice principle behind nevinho: the harness should do as much as possible on my own machine. Voice felt like the perfect place to apply that.

So I went looking for something that runs locally.

Picking whisper.cpp

whisper.cpp is Georgi Gerganov's port of OpenAI's Whisper model to plain C++. It runs on CPU, no GPU needed, and the smallest model (ggml-tiny.bin) is only about 75MB.

For my use case (short voice messages, English mostly, occasional French) tiny is more than enough. Transcription on a normal laptop takes 1 to 3 seconds.

Two binaries, one model, no network calls after install. Perfect.

The pipeline

When a user sends a voice message in Discord, the bot needs to do four things:

Detect that the attachment is audio
Download the file
Convert it to a format whisper.cpp accepts
Run whisper and get the text back

The format step matters. Discord voice messages come as .ogg files. whisper.cpp wants 16kHz mono WAV. So I lean on ffmpeg to do the conversion.

Here's the core of the transcribe function:

Код:

func Transcribe(whisperDir string, audioData []byte, filename string) (string, error) {
    binary := filepath.Join(whisperDir, "whisper-cli")
    model := filepath.Join(whisperDir, "ggml-tiny.bin")

    tmpDir, _ := os.MkdirTemp("", "nevinho-voice-*")
    defer os.RemoveAll(tmpDir)

    audioPath := filepath.Join(tmpDir, filename)
    os.WriteFile(audioPath, audioData, 0644)

    // Convert to 16kHz mono WAV for whisper.cpp
    wavPath := filepath.Join(tmpDir, "audio.wav")
    ffmpeg := exec.Command("ffmpeg",
        "-i", audioPath,
        "-ar", "16000", "-ac", "1",
        "-c:a", "pcm_s16le",
        wavPath, "-y")
    ffmpeg.Run()

    // Run whisper with a 30s timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    cmd := exec.CommandContext(ctx, binary,
        "-m", model, "-f", wavPath,
        "--no-timestamps", "-t", "4")
    output, _ := cmd.CombinedOutput()

    return cleanWhisperOutput(string(output)), nil
}

That's the whole thing. Two exec calls and a temp directory. No SDK, no streaming, no chunking.

Cleaning the output

whisper.cpp prints log lines mixed in with the transcription. Stuff like whisper_init_from_file_with_params: ... and system_info: .... Useful for debugging, noise for me.

So I strip them by prefix:

Код:

func cleanWhisperOutput(output string) string {
    var lines []string
    for _, line := range strings.Split(output, "\n") {
        line = strings.TrimSpace(line)
        if line == "" {
            continue
        }
        if strings.HasPrefix(line, "whisper_") ||
            strings.HasPrefix(line, "main:") ||
            strings.HasPrefix(line, "system_info:") ||
            strings.HasPrefix(line, "output_") {
            continue
        }
        lines = append(lines, line)
    }
    return strings.Join(lines, " ")
}

Not glamorous, but it works.

Installing whisper without scaring people

The harder problem was install. Most users won't want to know what whisper.cpp is. They just want voice to work.

So nevinho setup handles the whole thing. The first time you enable voice, it:

Installs ffmpeg if missing (apt, brew, dnf, apk)
Tries to download a pre-built whisper-cli binary from the nevinho releases page
Falls back to building from source if no pre-built binary exists for the platform
Downloads the 75MB tiny model from Hugging Face

The pre-built binary is the fast path. Most people get voice in under a minute. Building from source needs cmake and a C compiler, takes a few minutes, but works on anything.

Код:

func installWhisper(whisperDir, dest string) error {
    err := downloadWhisperBinary(dest)
    if err == nil {
        return nil
    }
    fmt.Println("  No pre-built binary for this platform, building from source...")
    return buildWhisper(whisperDir, dest)
}

This was the part I cared about most. A feature is only as good as how easy it is to turn on.

Wiring it into Discord

The Discord side is small. When a message comes in with no text but an audio attachment, transcribe it and pretend the user just typed the result.

Код:

if text == "" && len(m.Attachments) > 0 {
    for _, att := range m.Attachments {
        if isAudioAttachment(att) {
            transcribed := b.transcribeAttachment(s, m.ChannelID, att)
            if transcribed != "" {
                text = transcribed
                isVoice = true
            }
            break
        }
    }
}

From here the rest of the agent loop runs unchanged. The voice flow doesn't care that the input came from audio. It's just text now.

One small touch: when the agent replies to a voice message, it replies as a Discord quote-reply on the original audio. That way the conversation stays visually linked.

What it costs

Zero. No API key, no per-minute charge, no monthly subscription.

The only cost is the 75MB model on disk and a few seconds of CPU per voice message. On a $5/month VPS this is nothing.

What I'd improve

The tiny model is fast but not the most accurate, especially on accents or noisy audio. Swapping in base or small is a one-line change in the setup, just a different model file. I'll probably make it configurable.

The other thing on my list is better multilingual support. The tiny model technically handles many languages, but quality drops fast outside English. I switch between English, French, and Portuguese throughout the day, and tiny struggles with that.

The bigger Whisper models (small, medium) handle that mix much better. The tradeoff is size and CPU time. small is around 460MB and a few times slower than tiny, but still fine on a normal machine. The right answer is probably to let people pick the model during setup based on the languages they actually speak.

Try it

Nevinho is open source. If you want a personal AI bot in your DMs that can listen as well as read, it takes about a minute to install:

Код:

curl -sSL https://raw.githubusercontent.com/lucasnevespereira/nevinho/main/install.sh | bash
nevinho setup
nevinho start

Repo here: github.com/lucasnevespereira/nevinho.

Hope this gives you ideas for your own projects. If you build something similar or have feedback, feel free to open an issue or just say hi.

Источник: I added voice messages to my self-hosted AI agent, for free https://dev.to/lucasnevespereira/i-added-voice-messages-to-my-self-hosted-ai-agent-for-free-7pc

I added voice messages to my self-hosted AI agent, for free

Sascha

Why local instead of OpenAI's Whisper API​

Picking whisper.cpp​

The pipeline​

Cleaning the output​

Installing whisper without scaring people​

Wiring it into Discord​

What it costs​

What I'd improve​

Try it​

Why local instead of OpenAI's Whisper API

Picking whisper.cpp

The pipeline

Cleaning the output

Installing whisper without scaring people

Wiring it into Discord

What it costs

What I'd improve

Try it