The Silent Failure Mode: Verifying Telemetry Before It Ships · Journal

Today I shipped the first half of a new telemetry layer for Duskglow. The actual code, the schema, and the build configuration all landed cleanly. The decision worth writing about happened in the thirty seconds before any of it.

A solo founder building consumer software needs to see what's happening when users hit problems. One of our beta testers churned last week after losing two journal sessions to mid-conversation page refreshes on her Pixel 7. Server-side diagnostics turned up nothing useful. Supabase free-plan log retention is one day, the relevant sessions were eight days old, and the build had zero client-side error reporting. The whole reason I was building telemetry was to make sure that gap doesn't happen again.

The check that almost didn't happen

I had a Tier B Claude Code session staged and ready. The schema had been deployed earlier the same chat with no errors. Going straight to wiring would have been the obvious move. The table existed, the SQL had returned "Success, no rows returned," and the code in my emit module declared the right TypeScript shapes for the data going in.

What stopped me was a standing rule in our Technical Decision Loop. Every load-bearing decision gets pre-applied checks, and one of them is reversibility. Schema typing on a table that's already deployed is reversible while empty. Once events start flowing, an ALTER COLUMN on millions of rows becomes a maintenance event with downtime risk. Cheap now, expensive later.

So I asked Claude in chat to read the deployed DDL and tell me each column's actual type. Thirty seconds.

What the schema actually said

device_memory_gb integer null. viewport_width integer null. viewport_height integer null. screen_width integer null. screen_height integer null.

Five columns, all integer-typed.

The problem with device_memory_gb was specific. Chrome's navigator.deviceMemory API returns fractional values for low-RAM devices. A 256MB phone reports 0.25. A 512MB phone reports 0.5. Any tablet or budget Android with sub-1GB RAM reports a fraction. Postgres rejects an INSERT of 0.5 into an integer column with a type-mismatch error. Same problem on the four viewport and screen columns; high-DPR mobile devices and zoomed desktops can return fractional CSS pixel counts, and modern Chrome and Safari mobile both do this on certain device-pixel-ratio combinations.

The application code wraps every emit in try/catch with a static console.warn message and no error rethrow. That part was deliberate. Telemetry must never crash the app. The consequence in this case was that every failed insert would have been silently swallowed. No row written, no exception surfaced, no log entry, no way to know.

The failure mode I couldn't see

I was building telemetry to make a specific user segment's churn diagnosable. The tester was on a Pixel 7, an 8GB phone, which is not technically low-RAM in 2026 terms. But the browser tab discard hypothesis we'd been chasing implied memory pressure, the kind of state where the OS aggressively reclaims background tabs. The exact device behavior I most needed instrumentation for sits at the boundary between low-RAM and moderate-RAM, and the lower edge of that boundary is exactly where device_memory_gb returns fractional values.

Without the schema fix, telemetry from that user population would have been zero rows. From the dashboard it would have looked like clean, error-free sessions. I'd have looked at the empty data and concluded the problem was elsewhere. The conclusion would have been wrong.

The PM lesson here is one I keep circling back to. Instrumentation that silently fails is worse than no instrumentation. No telemetry forces honesty about what you don't know. Silent telemetry produces false confidence. The user segment most likely to break your assumptions is also the segment most likely to be invisible in your data, and that pairing is exactly the one you can't afford to get wrong.

The async frame that would have leaked

A separate finding from the same session sat at a different layer of the same lesson.

Telemetry captures structured stack traces on uncaught errors and unhandled rejections. A sanitizer scrubs known-sensitive function names before the row is written, blocking three function names tied to encryption work plus any frame whose file path includes /crypto/ or /encryption/. The privacy stakes are real here. A stack trace from a decrypted-entry parse failure that named the function would leak architectural detail about our cryptographic pipeline.

The first version of the parser worked, mostly. I caught a bug in chat-side review before the code got wired into anything.

V8 renders async function frames as at async decryptEntry (...). The original regex captured the function name as "async decryptEntry". The deny-list lookup checked against "decryptEntry". Mismatch. The deny-list missed every async-call-site crypto frame and let them through.

One line in the parser fixed it. But the principle generalizes. The deny-list was the implementation, the implementation depended on the parser, and the parser had a regex assumption nobody had verified. Three layers, one assumption, one privacy bug. Same shape as the schema typing finding. An assumption at one layer that, if wrong, invalidates the property the upper layer depends on.

Defense in depth, applied to instrumentation

The module written in this session has four enforcement layers. A type-level discriminated union constrains what payload shape each event type can carry. A runtime sanitizer strips sensitive content from message strings and stack traces. A sanctioned-helper-only write path enforced by a CI gate, landing next chat. A schema with append-only row-level security preventing UPDATE on any row.

Four layers of enforcement; two assumption-bugs caught chat-side before they shipped. Both would have been silent failures in production. One would have invalidated diagnostic capability on the most-needed user segment. The other would have invalidated a privacy property on stack traces from cryptographic call sites.

Defense in depth works in cryptography because no single layer is trusted. The same logic applies to instrumentation. No single layer's correctness can be assumed. The type contract has to be verified against the deployed schema. The regex assumption has to be verified against real V8 stack traces. The sanctioned-helper anchor has to be verified against barrel-refactor scenarios. Each one independently cheap; each one independently load-bearing.

Mental Models

Silent failure is the worst failure mode in instrumentation. No instrumentation forces honesty about what you don't know. Silent instrumentation produces false confidence. If you have to choose between no instrumentation and silently broken instrumentation, choose no instrumentation.

"Looks fine" is not verification. A passing test, a clean schema deploy, a green build, a sane-looking diff. Each is necessary; none is sufficient. The failure modes that matter most are the ones that look identical to success on the surface.

Reversibility check pairs with assumption check. Cheap now, expensive later is the most dangerous decision shape in software architecture. The thirty-second verification is almost always the right call when the alternative is a maintenance event on populated production data.

Defense in depth applies to telemetry, not just security. Type contracts, runtime sanitizers, sanctioned write paths, schema constraints. Each layer enforces what the next layer assumed. The bugs that matter live at the joins between layers, not inside any one of them.