HideView.
← All practice
Mobile & AR

AR + On-Device ML for Mobile

Custom CoreML models, ARKit world tracking, HealthKit-grounded voice — all on-device, sub-100ms, no network required.

Outcome

Native iOS systems that combine custom on-device ML, ARKit world-anchored detection, real-time voice grounded in live biometrics, and offline-first sync — engineered for environments where battery, latency, and connectivity all matter.

<100ms on-device
Inference
No
Network required
Camera · GPS · HRM · Cadence · Altimeter
Sensors integrated
Technologies
Swift 6SwiftUICoreMLVisionARKitMetalAVFoundationGemini LiveHealthKitSwiftDataCloudKitCoreHaptics
Problem

The hardware to do AR and on-device ML has shipped on the iPhone for years. The software typically has not used it. Cycling, field work, outdoor activities, and any application where the user moves through space need responses faster than the network can deliver and capabilities the cloud can't ground.

How it's built
  • Train custom CoreML models on the actual problem domain; compile through Metal-accelerated paths for sub-100ms inference
  • Use ARKit world tracking for spatial cues that don't drift as the user moves
  • Ground voice coaching in HealthKit-derived biometrics: heart rate, cadence, altitude, weather
  • Design data layer SwiftData-first with CloudKit sync; offline as the default, network as the bonus
  • Tune the whole pipeline for battery — inference happens in idle frame time, not on a timer

On-device CoreML means the response happens before the user notices. A custom-trained classifier runs sub-100ms in idle frame time. Battery cost is negligible. The model works in the woods, on a trail with no cell, on a long ride where the cloud isn't an option. That's the unlock — capabilities that don't depend on a round-trip.

ARKit world tracking is the difference between "there's an obstacle somewhere" and "there's an obstacle two seconds, your left." Anchored detection persists in world coordinates as the camera moves; naive bounding boxes drift. The system fails gracefully when tracking quality drops, falling back to plain Vision detection without the spatial precision.

Voice grounded in HealthKit data feels different from generic encouragement. "You're in zone four going up this climb — back off cadence by five RPM and shift up" is a real instruction grounded in real signals. When the cell signal drops mid-activity, the voice degrades to a smaller on-device path that still has the biometric context. Coaching survives the connectivity loss.

Local-first is non-negotiable for any product whose users move through space. Treating connectivity as the default and offline as the exception fails every time. Treating offline as the default and sync as the bonus is the design that survives a six-hour ride out of cell range.

What I'd tell someone about to build this
  • Custom on-device models beat generic cloud calls anywhere battery, latency, or connectivity matter.
  • AR world tracking turns vague detection into precise spatial cues. Worth the engineering.
  • Voice grounded in real signals is coaching. Ungrounded voice is noise.
  • Local-first is non-negotiable for products whose users move out of network range.

Want this for your product?

Let's talk about what you're trying to ship.

Book a call →
More practice