Debugging What You Can't Reproduce · Journal

A user reported that voice input was garbling words on their Pixel 9 Pro. Same app, same browser engine, same code. Worked on desktop Chrome. Broke on mobile Chrome.

I write code on the desktop. I don't write code on the phone. The bug lived on a device where I couldn't run the debugger, attach a profiler, or watch the API events fire. Most of the day was spent navigating that gap, and the navigation itself ended up being the more interesting story than the bug.

The first reflex was wrong

When the user pasted the screenshot, the input box read ney Luna how'show's ithow's it going. I had a hypothesis within thirty seconds: the interim accumulator was concatenating speech-recognition segments without spaces between them. Three candidate mechanisms, all plausible, all fixable in a few lines. I drafted the prompt, my engineering agent applied the diff, gates passed, and I shipped to a preview branch for the user to test.

The preview test came back worse. The output was now IIIII wantI want toI want to talkI want to talk about...

That was the moment I had to stop and check myself. The fix made it worse. My first hypothesis, which had felt obviously right, was clearly incomplete. I had two options. Iterate again with a refined hypothesis, or stop guessing and capture real data from the device where the bug lived.

The instinct to keep iterating was strong. I'd already written one fix; I had a second one half-formed. But the cost of the wrong second fix was another preview deploy, another mobile test, another screenshot to interpret. And in the meantime, I'd be reasoning from screenshots instead of from data. I chose to instrument.

Instrumentation before iteration

The decision was less about engineering and more about epistemics. I didn't actually know what the Web Speech API was emitting on Android Chrome. I had spec-derived assumptions and prior experience from desktop. The screenshots were data about output, not data about behavior. To debug behavior I couldn't reproduce, I needed to see the behavior itself.

I asked my engineering agent to add a temporary on-screen overlay. Every time the speech recognizer fired an event, the overlay would log the relevant fields: event.resultIndex, the length of event.results, and for each result, whether it was final and what its transcript was. Plus the post-logic state of what the handler was about to render. The overlay was scrollable, monospace, fixed to the top of the screen during active listening. Ugly and effective.

We pushed it as a temporary commit on the same branch. The user tested again on the Pixel and screenshotted the overlay mid-speech.

The data was unambiguous. Android Chrome was marking every progressive refinement of the recognized text as isFinal: true, and each refinement was arriving as a new entry in the results array containing the cumulative transcript so far. So saying "I'm doing okay today" produced a stream that looked like:

final: "I'm"
final: "I'm doing"  
final: "I'm doing okay"
final: "I'm doing okay today"

Each one tagged final. Each one a new array entry. Each one the full prefix.

This is not what the Web Speech API spec implies. The spec model is that a final result is committed and immutable, and new utterances become new results. Desktop Chrome behaves that way. Android Chrome does something else, and the something else broke any accumulator code that assumed the spec model.

My first fix had been treating each new "final" as a separate utterance to append. That's why the output got worse: I was now appending "I'm" + "I'm doing" + "I'm doing okay" + "I'm doing okay today" with separators between, producing the visible repetition pattern. The fix correctly implemented the wrong mental model of the API.

The actual fix was algorithmic, not local

Once I had the data, the fix shape became clear. Instead of accumulating across events, rebuild the final transcript fresh on every event by walking all final results and using a startsWith check. If the current segment starts with the previous one, it's a cumulative refinement: replace the previous with the current. If it doesn't start with the previous one, it's a genuinely new utterance: append the previous and start tracking the new one.

The desktop case still works because desktop never produces prefix-extending finals, so the algorithm just appends them with separators as expected. The Android case works because the algorithm collapses the prefix-extension pattern to the longest version. One algorithm, two platforms, no platform branching needed.

Removing the persistent ref I'd added in the first fix turned out to be a small relief. Eleven reset locations got deleted along with it. The new code is shorter than the old broken code.

What made this expensive, and what made it survivable

The expensive part was the iteration cycle. Every test required pushing to a Vercel preview branch, waiting for the build, switching to the phone, hard-refreshing to bypass cache, signing in, navigating to the right tab, speaking the test phrase, screenshotting the result. Maybe three minutes per cycle. Three minutes is fine for one iteration. It compounds quickly when the first fix is wrong and the second fix needs the data the first fix didn't give you.

What made it survivable was that production was never at risk. The whole debugging arc happened on a feature branch. Main only saw the merged result after the prefix-detection algorithm verified end-to-end. If I'd been pushing iterations directly to production, beta testers would have seen each broken version in turn. The user who originally reported the bug would have watched the input box get worse before getting better.

There was a moment during the OAuth setup for preview testing where this discipline almost broke. Sign-in on the preview URL was redirecting to production because the Supabase Auth allowlist didn't include the preview domain. The fastest path forward was to skip preview testing and push to main, which would have worked because main was the OAuth target. I didn't, and adding the preview wildcard to the allowlist took fifteen minutes including verification. That fifteen minutes preserved the discipline, and the discipline is what kept production clean while we were learning the actual bug.

Mental Models

Instrumentation before iteration when you can't reproduce. If the bug lives on a device or in an environment where you can't watch behavior directly, the next move is to make the behavior visible, not to keep guessing. Adding a temporary on-screen overlay felt like a detour. It saved at least two more wrong fixes.

Platform-specific behavior beats spec assumptions. I had read the Web Speech API spec. I had years of desktop experience with it. None of that prepared me for Android Chrome treating progressive refinements as cumulative finals. When a bug is platform-specific, the platform is doing something the spec doesn't describe. Treat the spec as a starting hypothesis. Platform behavior diverges in ways that matter.

Branch + preview deploy is a workflow primitive, not a Vercel feature. I'd used Vercel preview URLs before. I'd treated them as a deployment feature. This was the first time I used them as a workflow primitive: the place where untested fixes live and iterate while production stays clean. The cost of setup (OAuth wildcard, branch hygiene, preview-URL discipline) is a one-time investment that pays back every time a real-device bug surfaces.

The first fix is also data. When my initial trim-and-space accumulator made things worse, that was useful information. It killed two of my three candidate hypotheses and pointed me at instrumentation. If the first fix had silently improved things (without actually being right), I might have shipped something that worked for short phrases and broke later on long ones. The regression was loud, immediate, and forced the diagnostic loop I needed.