← Projects

Turnsense

Timeline4 Weeks, March 2025
PythonGoogle ColabHuggingFace
GitHub
Turnsense model

Overview

Turnsense is a lightweight end-of-utterance (EOU) detector built for real-time voice AI. It runs on SmolLM2‑135M, tuned for low-power devices like Raspberry Pi. 97.5% accuracy in fp32, 93.75% after 8-bit quantization.

EOU detection helps conversational systems figure out when someone’s actually done speaking, beyond what Voice Activity Detection (VAD) can offer. Turnsense does this by looking at the text itself: patterns, structure, and meaning.

The model’s open-source. Code is on GitHub if you want to look at it.

How I Made It

One afternoon I was tinkering with my TARS‑AI voice assistant on a Raspberry Pi and noticed something just felt off. The conversations weren’t flowing naturally. Voice Activity Detection (VAD) clearly wasn’t enough; I wasn’t even finished speaking and TARS suddenly turned off because it didn’t detect any voice when I was thinking of what to reply. So I started scouring the web to figure out how to make turn‑taking feel more human, and that’s when I came across LiveKit’s turn detector model, which I think had just come out around January. It was built on SmolLM2‑135M, but their license locks it into LiveKit Agents, so I couldn’t use it in TARS‑AI.

I was frustrated, so I dove into the MultiWOZ dataset, tried annotating 20K examples with an LLM, and hit a wall at 60% accuracy: the raw, noisy and inconsistent LLM‑annotated data just didn’t translate. I ended up hand-curating a 2K‑sample set of punctuated user sentences, trimmed the context window to the last utterance only, and fine‑tuned SmolLM2‑135M with a LlamaForSequenceClassification head. To keep training feasible on my limited compute, I used LoRA adapters (only a few million extra parameters to tune) so I could squeeze everything into a single Colab GPU in under an hour.

Honestly, I just picked sequence classification because it made sense for a yes/no task like “Is this the end?”, rather than chasing next‑token predictions. Maybe LiveKit had better reasons to go with LlamaForCausalLM, especially since SmolLM2 was probably trained with a prompt‑based objective. I haven’t really dug into what that output difference means yet, but if I ever explore causal LM heads, I’ll probably need to rethink how I tokenize and prep the data too. For now, this setup gets me where I need to be: 97.5% accuracy (93.75% after 8-bit quantization), and it runs fast enough on the Pi.

Turnsense training results
Turnsense benchmark on Raspberry Pi

What's Next

But one thing I noticed from my architecture is that it’s entirely text-based, so it misses a lot of cues and depends heavily on the quality of the STT output. You can’t really tell from just “Hello” whether someone’s done or about to keep going. So maybe in the future, I’d like to explore a multimodal approach to handle that: audio, prosody, or something else. I don’t know how yet, but I’ll figure it out.

Four weeks, two failed approaches, and a lot of Colab credits. Most of it was getting the training data right. The model runs on a Pi, it works in TARS, and that’s what I built it for.

97.5%

Accuracy (fp32)

93.75%

Accuracy (8-bit quant)

SmolLM2-135M

Base model

< 1 hr

Training on single Colab GPU