Drop-in voice input for web forms. Speak naturally. Fill forms intelligently.
voice-form transforms the way users interact with forms. Instead of typing, clicking, and tabbing through fields, users speak — and your forms fill themselves. Built on the Web Speech API and LLM-powered parsing, it's framework-agnostic, requires zero API keys in the browser, and everything happens in under 30 milliseconds.
npm install @voiceform/core
# or
pnpm add @voiceform/coreMinimal example — contact form with voice input in 15 lines:
import { createVoiceForm } from '@voiceform/core'
const voiceForm = createVoiceForm({
endpoint: '/api/voice-parse', // Your backend endpoint
schema: {
formName: 'Contact Form',
fields: [
{ name: 'fullName', label: 'Full Name', type: 'text', required: true },
{ name: 'email', label: 'Email', type: 'email', required: true },
{ name: 'message', label: 'Message', type: 'textarea' },
],
},
events: {
onDone: (result) => {
if (result.success) {
document.querySelector('form')?.submit()
}
},
},
})The mic button appears automatically. User taps it, speaks "John Smith, john at example dot com, tell them I'm interested", and the form fields fill before they even finish speaking. One confirmation tap and you're done.
1. Tap mic → 2. Speak naturally → 3. Confirm values → 4. Fields fill
- Capture: Web Speech API captures audio from the user's microphone
- Transcribe: Raw speech is converted to text — via Web Speech or your own STT backend
- Parse: You send the transcript to your backend endpoint, which calls an LLM to extract structured field values (see OpenAI Structured Outputs or Anthropic Tool Use)
- Confirm: Users verify what was heard before the form fills — no surprises
- Inject: Form fields update with sanitized values; synthetic
inputandchangeevents fire so frameworks stay in sync
voice-form never touches your API keys. You own the endpoint, the LLM calls, and all the secrets. The browser sends only the transcript and form schema — both visible on the Network tab and intentionally public-safe.
// Your backend (Node, Python, Go, whatever)
// POST /api/voice-parse
import { buildSystemPrompt, buildUserPrompt } from '@voiceform/server-utils'
export async function parseVoice(req) {
const { transcript, schema } = req.body
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: buildSystemPrompt(schema) },
{ role: 'user', content: buildUserPrompt(transcript) },
],
})
return {
fields: {
fullName: { value: 'John Smith' },
email: { value: 'john@example.com' },
},
}
}Full reference endpoints for SvelteKit, Next.js, and Express in examples/.
Everything voice-form does takes under 30ms. That's one animation frame at 60fps. The LLM inference time (200-500ms), network latency, and user reading time are all on your backend — not us. Check our bundle size on Bundlephobia.
| Package | Size (gzip) |
|---|---|
@voiceform/core (headless) |
~7.2 KB |
@voiceform/core/ui |
~4.8 KB |
@voiceform/svelte |
~171 B |
First-class Svelte 5 support:
pnpm add @voiceform/svelte<script>
import { VoiceForm } from '@voiceform/svelte'
</script>
<VoiceForm
endpoint="/api/voice-parse"
schema={mySchema}
/>See the API Reference for the full component API, snippets, and store integration.
First-class React 18+ support:
pnpm add @voiceform/react
# peer deps: react >=18, react-dom >=18, @voiceform/core >=2.0.0Hook API:
import { useVoiceForm } from '@voiceform/react'
function ContactForm() {
const { state, instance } = useVoiceForm({
endpoint: '/api/voice-parse',
schema: {
formName: 'Contact Form',
fields: [
{ name: 'fullName', label: 'Full Name', type: 'text', required: true },
{ name: 'email', label: 'Email', type: 'email', required: true },
],
},
})
return (
<button
onClick={() => instance.start()}
disabled={state.status !== 'idle'}
>
{state.status === 'recording' ? 'Listening…' : 'Speak to fill form'}
</button>
)
}Component API (with default UI):
import { VoiceForm } from '@voiceform/react'
<VoiceForm
endpoint="/api/voice-parse"
schema={mySchema}
onDone={(result) => console.log('Filled:', result)}
/>With React Hook Form (onFieldsResolved bypasses DOM injection):
<VoiceForm
endpoint="/api/voice-parse"
schema={mySchema}
onFieldsResolved={(fields) => {
Object.entries(fields).forEach(([name, value]) => setValue(name, value))
}}
/>See the API Reference for the full hook and component API.
- React integration —
@voiceform/reactwithuseVoiceFormhook andVoiceFormcomponent - Whisper STT adapter —
MediaRecorder-based adapter via@voiceform/core/adapters/whisper - Append mode — Speak to append to existing field values (
appendMode: true) - Multi-step forms — Inject across wizard steps without errors on missing DOM fields (
multiStep: true) - Field corrections — Users can edit parsed values in the confirmation panel (
correctField()) - Auto-detect schema — Scan a form element to infer schema from DOM (
autoDetectSchema: true) - Developer tooling —
@voiceform/devwith schema inspector, logging middleware, and state visualizer
- No API keys in the browser. Ever. BYOE is not an option — it's the only path.
- Output sanitization. All LLM values are sanitized before DOM injection. XSS impossible from parsed fields.
- CSRF protection. Requests include the
X-VoiceForm-Requestheader. Your endpoint validates it. - Prompt injection defense. The transcript is passed to the LLM as JSON-escaped data in a separate
usermessage, not string-interpolated into instructions. See OWASP LLM01: Prompt Injection and the Prompt Injection Prevention Cheat Sheet. - Transparent data flows. Web Speech API sends audio to Google. If that concerns you, bring your own STT adapter.
See SECURITY.md for the full threat model and the OWASP Top 10 for LLM Applications.
Voice input must be disclosed to users. The Web Speech API sends audio to Google's servers. voice-form provides built-in controls to show a privacy notice before requesting mic permission.
createVoiceForm({
privacyNotice: 'Voice input uses your browser\'s speech recognition, processed by Google.',
requirePrivacyAcknowledgement: true,
// ...
})Read PRIVACY.md for data flows, GDPR considerations, and developer responsibilities.
voice-form follows WAI-ARIA Authoring Practices for all interactive elements:
- Mic button:
role="button", keyboard-activatable (Enter/Space),aria-labelper state - Confirmation panel:
role="dialog", focus trap, Escape to cancel - Screen reader announcements via
aria-liveregions prefers-reduced-motionrespected for all animations
voice-form requires Web Speech API support:
- Chrome, Edge, Safari 14.1+ (full)
- Firefox 25+ (via flag in Nightly)
On unsupported browsers, the library gracefully disables the mic button and shows an "unavailable" message.
The first time a user taps the mic button, their browser shows a microphone permission prompt. This is one-time only — the permission is remembered per origin.
- API Reference — Complete config, methods, types, and error codes
- Demo Site — Working contact form with voice input
- Security Guide — Threat model, developer checklist, mitigation strategies
- Privacy Guide — Data flows, compliance, user disclosure
- Architecture — High-level design, state machine, data flow
- Examples — SvelteKit, Next.js, Express endpoint implementations
- form2agent-ai-react — Voice-assisted AI form filling with React and OpenAI
- whisper-anywhere — Chrome extension for voice input using Whisper
- Vocode — Open-source library for voice-based LLM applications
- Ultravox — Multimodal LLM with native speech understanding
Issues and PRs welcome. We use Changesets for versioning. Check CONTRIBUTING.md for the full workflow.