Local by default
Audio and speech recognition stay on the Mac. Cloud-based enhancement is optional, user-triggered, and separate from the recording path.
Case study · Personal project
Private meeting transcription that stays editable while the conversation is still happening.
A native macOS workflow for capturing microphone audio, the other side of a call, or both—then transcribing everything locally with Core ML.
Role
Product, design, and engineering
Platform
macOS 14.4+
Core stack
React, Tauri, Rust, Swift, Core ML
The problem
Most transcription tools ask for a quiet trade: send the conversation to a server, accept a rigid transcript, and clean it up later in a separate document.
Meeting Notes began with a stricter brief: capture both sides of a call, keep recognition on-device, and let the user keep writing in the same note while transcription continues around them.
That made privacy, editability, and native audio capture architectural constraints rather than features to bolt on at the end.
Product decisions
Audio and speech recognition stay on the Mac. Cloud-based enhancement is optional, user-triggered, and separate from the recording path.
The document remains editable during recording. Manual changes are preserved while new transcript segments continue after them.
Microphone and system audio can be recorded independently or together, with clear Me and Them labels in dual-source mode.
Architecture
The application uses a narrow event boundary between the web interface, the desktop shell, and native audio processing. That keeps the editor productive without asking the browser layer to pretend it is macOS.
Owns the note editor, recording controls, and visible recording state.
Coordinates audio capture, process lifecycle, and typed events between the interface and native sidecar.
Handles streaming speech recognition and the Core Audio process tap for system sound.
Runs Parakeet v3 locally for live captions, system audio, and file transcription across 25 languages.
What shipped
Live captions stream into the note with roughly three seconds of latency.
Dual-source mode captures both sides of a call without requiring screen-recording permission.
Every decoded window is retained instead of silently dropping transcript segments.
One local speech model supports live recording, system audio, and imported audio files.
The raw transcript remains the source of truth until the user explicitly chooses to enhance it.
Reflection
Keeping speech recognition local did more than protect audio. It created a clearer product contract: recording produces a useful raw note, editing never stops the stream, and AI enhancement only happens after an explicit choice.