HideView.
← All practice
Mobile & AR

Voice & Text Operations

Run the business from anywhere. Same tools, same context, voice or text, picked by the situation.

Outcome

Mobile interfaces where business owners and field staff manage real operations — not just queries, but mutations to live business state — by voice while driving and by text in meetings, with the same tool registry behind both.

Voice · Text
Modalities
Real-time bidi
Latency target
On uncertainty, always
Refusal pattern
Technologies
Gemini Live (Bidi WebSocket)Tool-calling schemasDomain-grounded system promptsRefusal-on-uncertainty patternsSwiftUIiMessage-style chat UIGraphQL
Problem

The tools that run trade and field businesses were designed for desktops. Their users live on the road. The AI generation made hands-free operation actually possible — but only if the model can be trusted to mutate real business state, which most voice integrations cannot.

How it's built
  • Design tool schemas with strict types so the model can't smuggle ambiguity into business mutations
  • Ground the model in the business's vocabulary, units, and constraints in the system prompt
  • Require refusal-on-uncertainty: when a required field is ambiguous, ask, don't guess
  • Share one tool registry between the voice and text surfaces; preserve conversational state across modality switches
  • Stream audio over Gemini Live's bidirectional WebSocket; treat the protocol as solved and spend the budget on judgment

Voice AI fails in the same place every time. Teams spend their engineering budget on the audio plumbing because it's what's new, then ship a model that can't tell inside-mount from outside-mount. The audio path is one library import. The judgment is the product.

What makes voice operations work is the system prompt and the tool schema. Inches with fractional eighths, never decimals. Inside-mount versus outside-mount as a constrained enum, never a free-text guess. Refusal-on-uncertainty as the highest-leverage line in the prompt.

Voice and text aren't separate products. They're surfaces over the same tool registry. A workflow started by voice — "add a measurement note to project Hartman" — can be finished by text without losing context. Switching modalities mid-conversation feels like switching from typing to dictation, not from one app to another.

What I'd tell someone about to build this
  • Strict tool schemas eliminate an entire class of bugs. Make the model commit to enums and integer ranges; it asks instead of guessing.
  • Domain grounding in the system prompt catches more bugs than any audio-pipeline tuning ever will.
  • Don't build voice and text as separate products. They're modalities over one shared toolset.

Want this for your product?

Let's talk about what you're trying to ship.

Book a call →
More practice