Ofline
Hello there!
I've been building a small personal AI agent called nevinho. It lives in my Discord DMs and helps me with everyday stuff. Bash commands, web searches, code questions, the usual.
After using it for a while, the friction became obvious. Typing into Discord on my phone while walking or cooking is awful. I wanted to just talk to it.
Discord already has voice messages built in. You hold a button, talk, and it sends an audio attachment. So the job was simple on paper. Receive the audio, turn it into text, feed it to the agent.
This post is the story of how I wired that up without paying for an API.
Why local instead of OpenAI's Whisper API
My first instinct was to call the OpenAI Whisper API. It works, it's accurate, it's cheap.
But this is a personal bot. Every dependency I add is one more thing that can break, expire, or cost money I forgot about. I already had API keys for the LLM. I didn't want another bill, another secret to rotate, another vendor lock.
There's also a nice principle behind nevinho: the harness should do as much as possible on my own machine. Voice felt like the perfect place to apply that.
So I went looking for something that runs locally.
Picking whisper.cpp
whisper.cpp is Georgi Gerganov's port of OpenAI's Whisper model to plain C++. It runs on CPU, no GPU needed, and the smallest model (
ggml-tiny.bin) is only about 75MB.For my use case (short voice messages, English mostly, occasional French) tiny is more than enough. Transcription on a normal laptop takes 1 to 3 seconds.
Two binaries, one model, no network calls after install. Perfect.
The pipeline
When a user sends a voice message in Discord, the bot needs to do four things:
- Detect that the attachment is audio
- Download the file
- Convert it to a format whisper.cpp accepts
- Run whisper and get the text back
The format step matters. Discord voice messages come as
.ogg files. whisper.cpp wants 16kHz mono WAV. So I lean on ffmpeg to do the conversion.Here's the core of the transcribe function:
Код:
func Transcribe(whisperDir string, audioData []byte, filename string) (string, error) {
binary := filepath.Join(whisperDir, "whisper-cli")
model := filepath.Join(whisperDir, "ggml-tiny.bin")
tmpDir, _ := os.MkdirTemp("", "nevinho-voice-*")
defer os.RemoveAll(tmpDir)
audioPath := filepath.Join(tmpDir, filename)
os.WriteFile(audioPath, audioData, 0644)
// Convert to 16kHz mono WAV for whisper.cpp
wavPath := filepath.Join(tmpDir, "audio.wav")
ffmpeg := exec.Command("ffmpeg",
"-i", audioPath,
"-ar", "16000", "-ac", "1",
"-c:a", "pcm_s16le",
wavPath, "-y")
ffmpeg.Run()
// Run whisper with a 30s timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, binary,
"-m", model, "-f", wavPath,
"--no-timestamps", "-t", "4")
output, _ := cmd.CombinedOutput()
return cleanWhisperOutput(string(output)), nil
}
That's the whole thing. Two
exec calls and a temp directory. No SDK, no streaming, no chunking.Cleaning the output
whisper.cpp prints log lines mixed in with the transcription. Stuff like
whisper_init_from_file_with_params: ... and system_info: .... Useful for debugging, noise for me.So I strip them by prefix:
Код:
func cleanWhisperOutput(output string) string {
var lines []string
for _, line := range strings.Split(output, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
if strings.HasPrefix(line, "whisper_") ||
strings.HasPrefix(line, "main:") ||
strings.HasPrefix(line, "system_info:") ||
strings.HasPrefix(line, "output_") {
continue
}
lines = append(lines, line)
}
return strings.Join(lines, " ")
}
Not glamorous, but it works.
Installing whisper without scaring people
The harder problem was install. Most users won't want to know what whisper.cpp is. They just want voice to work.
So
nevinho setup handles the whole thing. The first time you enable voice, it:- Installs
ffmpegif missing (apt, brew, dnf, apk) - Tries to download a pre-built
whisper-clibinary from the nevinho releases page - Falls back to building from source if no pre-built binary exists for the platform
- Downloads the 75MB tiny model from Hugging Face
The pre-built binary is the fast path. Most people get voice in under a minute. Building from source needs
cmake and a C compiler, takes a few minutes, but works on anything.
Код:
func installWhisper(whisperDir, dest string) error {
err := downloadWhisperBinary(dest)
if err == nil {
return nil
}
fmt.Println(" No pre-built binary for this platform, building from source...")
return buildWhisper(whisperDir, dest)
}
This was the part I cared about most. A feature is only as good as how easy it is to turn on.
Wiring it into Discord
The Discord side is small. When a message comes in with no text but an audio attachment, transcribe it and pretend the user just typed the result.
Код:
if text == "" && len(m.Attachments) > 0 {
for _, att := range m.Attachments {
if isAudioAttachment(att) {
transcribed := b.transcribeAttachment(s, m.ChannelID, att)
if transcribed != "" {
text = transcribed
isVoice = true
}
break
}
}
}
From here the rest of the agent loop runs unchanged. The voice flow doesn't care that the input came from audio. It's just text now.
One small touch: when the agent replies to a voice message, it replies as a Discord quote-reply on the original audio. That way the conversation stays visually linked.
What it costs
Zero. No API key, no per-minute charge, no monthly subscription.
The only cost is the 75MB model on disk and a few seconds of CPU per voice message. On a $5/month VPS this is nothing.
What I'd improve
The tiny model is fast but not the most accurate, especially on accents or noisy audio. Swapping in
base or small is a one-line change in the setup, just a different model file. I'll probably make it configurable.The other thing on my list is better multilingual support. The tiny model technically handles many languages, but quality drops fast outside English. I switch between English, French, and Portuguese throughout the day, and tiny struggles with that.
The bigger Whisper models (
small, medium) handle that mix much better. The tradeoff is size and CPU time. small is around 460MB and a few times slower than tiny, but still fine on a normal machine. The right answer is probably to let people pick the model during setup based on the languages they actually speak.Try it
Nevinho is open source. If you want a personal AI bot in your DMs that can listen as well as read, it takes about a minute to install:
Код:
curl -sSL https://raw.githubusercontent.com/lucasnevespereira/nevinho/main/install.sh | bash
nevinho setup
nevinho start
Repo here: github.com/lucasnevespereira/nevinho.
Hope this gives you ideas for your own projects. If you build something similar or have feedback, feel free to open an issue or just say hi.