Mar 5, 2025

Build a Local Voice Assistant with Home Assistant (No Cloud Required)

Alexa and Google Home work fine — until they don’t. Internet goes down, servers lag, or you realize a company is recording everything you say. Here’s how to build a voice assistant that runs entirely on your local network.

The Architecture

A local voice assistant has four pieces:

Wake word detection — listens for a trigger phrase (“Hey Jarvis”, “OK Home”, etc.)
Speech-to-text (STT) — converts your voice to text
Intent processing — figures out what you want and executes it
Text-to-speech (TTS) — speaks the response back

Home Assistant’s voice pipeline connects all of these. You can mix local and cloud components, but we’re going fully local.

Hardware Options

Satellite Device (the microphone/speaker in each room)

ESP32-S3-BOX-3 (~$45) — The best option right now. Built-in microphone, speaker, and display. Runs ESPHome with the voice_assistant component. Just flash and go.

Alternative: Any ESP32-S3 board + an I2S microphone (INMP441) and I2S speaker (MAX98357A). Cheaper but requires wiring.

Processing Server

You need a machine to run the STT and TTS engines. Options:

Home Assistant’s own hardware — works for Whisper (STT) but can be slow on a Pi
A separate server — any Linux box with decent CPU. A machine with a GPU makes Whisper much faster
Your existing homelab server — if you’re already running Docker, just add containers

Software Stack

Component	What I Use	Why
Wake word	openWakeWord (local)	Low latency, customizable
STT	Faster Whisper	Best accuracy-to-speed ratio for local
Intent	Home Assistant Conversation	Built-in, handles HA commands natively
TTS	Piper	Fast, natural-sounding, 100% local

Step-by-Step Setup

1. Install the Add-ons

In Home Assistant, go to Settings → Add-ons and install:

Whisper (or Faster Whisper if available)
Piper
openWakeWord

Start each one. Default settings work fine to start.

2. Configure the Voice Pipeline

Go to Settings → Voice assistants → Add assistant:

STT: Select your Whisper instance
TTS: Select your Piper instance
Wake word: Select openWakeWord
Conversation agent: Home Assistant

3. Flash the ESP32-S3-BOX-3

In ESPHome, create a new device and use the voice-assistant preset for the S3-BOX-3. The key configuration:

voice_assistant:
  microphone: box_mic
  speaker: box_speaker
  on_wake_word_detected:
    - light.turn_on:
        id: led_ring
        effect: "Listening"
  on_stt_end:
    - light.turn_on:
        id: led_ring
        effect: "Processing"
  on_tts_end:
    - light.turn_off:
        id: led_ring

4. Test It

Say your wake word, then try:

“Turn on the kitchen lights”
“What’s the temperature inside?”
“Lock the front door”

Performance Tuning

Whisper model size matters. Start with tiny or base for fast responses. Move to small if accuracy isn’t good enough. medium and large are only worth it if you have a GPU.

Model	Speed (CPU)	Accuracy
tiny	~1-2s	Good for simple commands
base	~2-4s	Better for natural speech
small	~4-8s	Handles accents well

Piper voice selection: Pick a voice trained on your language. English voices like en_US-lessac-medium sound natural. Higher quality = slightly more latency.

What Works Well

Direct commands (“turn off the bedroom lights”) — near-perfect accuracy
Queries about entity states (“is the garage door open?”) — works great
Simple routines (“goodnight” triggers your scene) — very reliable

What’s Still Rough

Conversational follow-ups — it doesn’t maintain context between commands
Music/media control — “play jazz” requires extra integration work
Background noise — kitchens and living rooms with TV audio cause false wake words

Cost Comparison

Setup	Cost	Privacy	Latency
Amazon Echo	$50	Cloud-dependent	Fast
Google Nest	$50	Cloud-dependent	Fast
Local (ESP32-S3-BOX-3 + server)	$45 + existing hardware	100% local	2-5 seconds

The latency gap is real — cloud assistants respond in under a second, local usually takes 2-5 seconds depending on your hardware. But your voice data never leaves your house.

Is It Worth It?

If privacy matters to you and you already have a homelab, absolutely. The ESP32-S3-BOX-3 is a solid piece of hardware, and the open-source speech models improve every few months.

If you just want lights to turn on when you talk, an Echo Dot is simpler. No shame in that.

The real win is building something you fully control — no subscriptions, no cloud dependency, no “sorry, I’m having trouble connecting right now.”

For more local-only smart home ideas, check out my list of 15 devices that work without internet. And if you’re setting up your network to support all of this, my VLAN guide covers how to properly segment your IoT traffic.