Back to Homelab
services

Whisper Transcription

Identity

ServiceWhisper Transcription
Container Typelxc
VMID105
IP:Port10.1.10.105

Host

HostJack

Network

VLANVLAN 10 — Production

Resources

vCPU1
RAM512 MB
Disk2 GB
OSDebian 13 (trixie)
Domain

Depends On

None

Depended On By

None

Overview

Whisper Transcription is an automatic video transcription service running on a lightweight Debian 13 LXC container. It uses the OpenAI Whisper API to convert video files into SRT subtitle files without any local GPU requirements.

How It Works

The workflow is simple: drop a video file into the SMB share, and the service handles the rest.

  1. File detection — A systemd polling service scans the incoming directory every 30 seconds for new video files (.mp4, .mkv, .webm, .mov)
  2. Stability check — Before processing, the service verifies the file has finished copying by comparing file sizes 10 seconds apart. This prevents processing partially-transferred files over SMB
  3. Audio extraction — ffmpeg extracts audio as a mono 16kHz MP3 at 48kbps, keeping file sizes small for the API
  4. Chunking — If the audio exceeds the 25MB API limit, it's automatically split into 20-minute chunks
  5. Transcription — Each chunk is sent to the OpenAI Whisper API (whisper-1 model) which returns SRT-formatted subtitles
  6. Merging — For chunked files, the SRT segments are merged with corrected timestamps to produce a single continuous transcript
  7. Output routing — The original video, extracted audio, and SRT file are moved to a completed directory organized by filename

Architecture

The service runs as an unprivileged LXC container with bind mounts to the same ZFS datasets used by the Samba file server. This means files dropped via SMB are immediately visible to the transcription service with zero network overhead — both containers access the underlying ZFS storage directly.

A separate AI-powered correction step (run on demand) reviews the raw transcripts to fix common Whisper mistakes like misheard technical terms and hallucinated content before the transcripts are published.

Why OpenAI API

A previous setup used whisper.cpp with local GPU acceleration, which required building from source for a specific GPU architecture, maintaining the ROCm stack, and keeping the inference host online. The API approach trades a small per-file cost (~$0.006/minute) for zero maintenance and consistent results.