Speaker Enrollment¶
Speaker identification lets Kenzy know who is talking. This enables personalized responses and is required for sensitive operations like locking and unlocking doors.
How it works¶
During enrollment, Kenzy records several short audio samples from a person and computes a speaker embedding — a compact numerical representation of their voice. At runtime, each captured utterance is compared against all enrolled embeddings using cosine similarity. The speaker with the highest similarity above the configured threshold is returned; otherwise the speaker is reported as unknown.
Embeddings are stored as .npy files in data/speakers/<name>.npy.
Requirements¶
- The
kenzy-speakerservice must be running - The
kenzy-ttsservice must be running (used to read prompts aloud during enrollment) - A microphone connected to the machine running
kenzy-enroll
Running enrollment¶
kenzy-enroll [configs/speaker.yaml]
The CLI will:
- Ask for the speaker's name
- Read each enrollment prompt aloud via TTS
- Record the speaker saying the prompt
- Repeat for all prompts
- Compute and save the embedding to
data/speakers/<name>.npy
The default prompts are phonetically diverse sentences chosen to capture a broad range of sounds. You can customize them in configs/speaker.yaml under enroll_prompts.
Enrolling by voice (from a node)¶
You can also enroll without the CLI by speaking to a room node — say something like "Hey Kenzy, enroll me as Alice". Kenzy then reads out the enroll_prompts sentences (the same configurable list the CLI uses) and records your reply to each through that node's microphone, so the samples come from the device and room you actually use.
This is off by default. Enable it from the dashboard (Services → speaker →
toggle allow_voice_enroll on, Save) or by setting it in the speaker config:
allow_voice_enroll: true
The server reads this live, so a dashboard toggle takes effect without a restart.
How it flows once enabled:
- You ask Kenzy to enroll a name; it replies "Okay, enrolling Alice…".
- After each prompt tone, read back the sentence Kenzy asks for (one sample per
enroll_promptsentry) — it POSTs each tokenzy-speaker. - It confirms "All done — I've enrolled Alice." (or cancels after several unclear captures, or times out if abandoned).
Why it's off by default
When voice enrollment is on, anyone within earshot of a node can enroll — including under an existing name. Because speaker identity gates sensitive actions (e.g. unlocking doors), that could let someone register their voice as a trusted person and bypass the gate. Leave it off unless you trust everyone with microphone access, and prefer the kenzy-enroll CLI for the people who can unlock things. Speaker ID is a convenience gate, not strong authentication (see Security implications).
Re-enrolling a speaker¶
Run kenzy-enroll again with the same name. The existing embedding file is overwritten.
Removing a speaker¶
Delete the embedding file:
rm data/speakers/<name>.npy
Restart kenzy-speaker for the change to take effect.
Tuning the identification threshold¶
The identify_threshold in configs/speaker.yaml controls how strict the match must be:
| Threshold | Behavior |
|---|---|
0.20 |
Permissive — fewer unknown results, higher risk of misidentification |
0.25 |
Default — good balance for a home environment |
0.30–0.35 |
Strict — more unknown results if audio quality varies |
If enrolled speakers are frequently returned as unknown, lower the threshold. If strangers are being matched to enrolled speakers, raise it.
Enrollment quality
Record in the room and with the microphone you will use day-to-day. Enrollment done in a quiet studio with a headset will not generalize well to a noisy kitchen with a far-field mic.
Security implications¶
Speaker identification is not a strong authentication mechanism — it can be fooled by a recording or a similar-sounding voice. It is used as a convenience gate (requiring a recognisable voice for lock/cover operations) rather than a cryptographic security boundary.