Build a Local Voice Assistant with Home Assistant (No Cloud Required)


Alexa and Google Home work fine — until they don’t. Internet goes down, servers lag, or you realize a company is recording everything you say. Here’s how to build a voice assistant that runs entirely on your local network.

The Architecture

A local voice assistant has four pieces:

  1. Wake word detection — listens for a trigger phrase (“Hey Jarvis”, “OK Home”, etc.)
  2. Speech-to-text (STT) — converts your voice to text
  3. Intent processing — figures out what you want and executes it
  4. Text-to-speech (TTS) — speaks the response back

Home Assistant’s voice pipeline connects all of these. You can mix local and cloud components, but we’re going fully local.

Hardware Options

Satellite Device (the microphone/speaker in each room)

ESP32-S3-BOX-3 (~$45) — The best option right now. Built-in microphone, speaker, and display. Runs ESPHome with the voice_assistant component. Just flash and go.

Alternative: Any ESP32-S3 board + an I2S microphone (INMP441) and I2S speaker (MAX98357A). Cheaper but requires wiring.

Processing Server

You need a machine to run the STT and TTS engines. Options:

  • Home Assistant’s own hardware — works for Whisper (STT) but can be slow on a Pi
  • A separate server — any Linux box with decent CPU. A machine with a GPU makes Whisper much faster
  • Your existing homelab server — if you’re already running Docker, just add containers

Software Stack

ComponentWhat I UseWhy
Wake wordopenWakeWord (local)Low latency, customizable
STTFaster WhisperBest accuracy-to-speed ratio for local
IntentHome Assistant ConversationBuilt-in, handles HA commands natively
TTSPiperFast, natural-sounding, 100% local

Step-by-Step Setup

1. Install the Add-ons

In Home Assistant, go to Settings → Add-ons and install:

  • Whisper (or Faster Whisper if available)
  • Piper
  • openWakeWord

Start each one. Default settings work fine to start.

2. Configure the Voice Pipeline

Go to Settings → Voice assistants → Add assistant:

  • STT: Select your Whisper instance
  • TTS: Select your Piper instance
  • Wake word: Select openWakeWord
  • Conversation agent: Home Assistant

3. Flash the ESP32-S3-BOX-3

In ESPHome, create a new device and use the voice-assistant preset for the S3-BOX-3. The key configuration:

voice_assistant:
  microphone: box_mic
  speaker: box_speaker
  on_wake_word_detected:
    - light.turn_on:
        id: led_ring
        effect: "Listening"
  on_stt_end:
    - light.turn_on:
        id: led_ring
        effect: "Processing"
  on_tts_end:
    - light.turn_off:
        id: led_ring

4. Test It

Say your wake word, then try:

  • “Turn on the kitchen lights”
  • “What’s the temperature inside?”
  • “Lock the front door”

Performance Tuning

Whisper model size matters. Start with tiny or base for fast responses. Move to small if accuracy isn’t good enough. medium and large are only worth it if you have a GPU.

ModelSpeed (CPU)Accuracy
tiny~1-2sGood for simple commands
base~2-4sBetter for natural speech
small~4-8sHandles accents well

Piper voice selection: Pick a voice trained on your language. English voices like en_US-lessac-medium sound natural. Higher quality = slightly more latency.

What Works Well

  • Direct commands (“turn off the bedroom lights”) — near-perfect accuracy
  • Queries about entity states (“is the garage door open?”) — works great
  • Simple routines (“goodnight” triggers your scene) — very reliable

What’s Still Rough

  • Conversational follow-ups — it doesn’t maintain context between commands
  • Music/media control — “play jazz” requires extra integration work
  • Background noise — kitchens and living rooms with TV audio cause false wake words

Cost Comparison

SetupCostPrivacyLatency
Amazon Echo$50Cloud-dependentFast
Google Nest$50Cloud-dependentFast
Local (ESP32-S3-BOX-3 + server)$45 + existing hardware100% local2-5 seconds

The latency gap is real — cloud assistants respond in under a second, local usually takes 2-5 seconds depending on your hardware. But your voice data never leaves your house.

Is It Worth It?

If privacy matters to you and you already have a homelab, absolutely. The ESP32-S3-BOX-3 is a solid piece of hardware, and the open-source speech models improve every few months.

If you just want lights to turn on when you talk, an Echo Dot is simpler. No shame in that.

The real win is building something you fully control — no subscriptions, no cloud dependency, no “sorry, I’m having trouble connecting right now.”

For more local-only smart home ideas, check out my list of 15 devices that work without internet. And if you’re setting up your network to support all of this, my VLAN guide covers how to properly segment your IoT traffic.