Building a voice assistant that lets workshop staff manage appointments, customers, and vehicle history through natural conversation. From a self-hosted pipeline with 3-second delays to production-grade real-time voice with LiveKit.
Client: INofficina.it
Auto repair workshops handle a constant stream of service requests: tyre changes, engine maintenance, bodywork, vehicle inspections. Staff juggle booking calendars, customer databases, and vehicle records, switching between screens while a customer is on the phone or standing at the counter.
My goal was to let workshop staff say something like "Book a service for the grey SUV, plate AB 123 CD, first available slot next Wednesday, and send a WhatsApp confirmation with a reminder the day before" and have it all happen in one go. No typing, no screen switching, no manual data entry across three different systems.
But there are also more complex scenarios. For example: "Email all customers who had a tyre change more than 6 months ago and remind them it's time to come back." This kind of request touches multiple services, requires heavy queries across the CRM, and needs tool implementations that are both performant and safe.
Building a voice assistant for these scenarios is both a technical and a design problem. Two challenges shaped the entire project: latency (a 3-second delay turns an assistant into an obstacle) and observability (I needed full visibility into every stage of the pipeline: what was transcribed, how the LLM interpreted it, and how the response was converted back to speech).
The first choice was the voice architecture. Conversational (end-to-end) models like OpenAI Realtime API and Gemini Live handle everything in one shot. Lower latency, but the model is a black box: you can't see what was transcribed, how the LLM reasoned, or how the TTS rendered the response. You can't swap components independently, and debugging is guesswork. The alternative is a cascade pipeline (Speech-to-Text → LLM → Text-to-Speech) where each stage is observable and independently tunable. I chose the cascade approach because I needed that observability: when a license plate is misheard, I need to know whether it was a transcription error or an LLM interpretation issue. When a response sounds wrong, I need to isolate whether the problem is in the text generation or in the speech synthesis.
The initial cascade implementation ran on my own servers: OpenAI Whisper for transcription, GPT-4o with function calling for intent processing, and OpenAI TTS for speech synthesis. This architecture worked for prototyping and proved the cascade model was correct, but each step added latency. Audio upload, transcription, LLM reasoning, TTS generation, audio download. By the time the assistant replied, 3 to 5 seconds had passed. The pipeline was right; the execution needed to be faster.
I found LiveKit, an open-source platform for real-time audio and video communication. LiveKit lets you run the same cascade pipeline but on infrastructure designed for real-time performance: WebRTC for low-latency audio streaming, a Python framework (LiveKit Agents) for building voice AI pipelines server-side, built-in Voice Activity Detection, and streaming TTS that starts playing before the full sentence is generated. Audio streams via WebRTC with no file uploads and no polling. The LLM starts generating while the user is still finishing their sentence, and TTS plays while the LLM is still generating text. I kept full observability into every stage while getting sub-second response times.
INofficina.it runs on WordPress with the Bookly appointment plugin (PHP), but LiveKit Agents is Python-native. I needed to bridge the two. The LiveKit Agent (Python) handles the voice pipeline. The WordPress REST API (PHP) exposes appointment tools: create, search, modify bookings, look up customers and vehicles, check availability. The AI Commander plugin connects them, registering tools in WordPress and exposing them as REST endpoints that the Python agent calls via function calling. When the LLM needs to book an appointment, it calls a function; the Python agent forwards it to the WordPress REST API; the result flows back and gets spoken to the user, all in under a second.