Case Study

Proving Luna Is Safe

Adversarial Testing and OWASP-Mapped Security for an AI Companion

Technical PM: safety architecture, adversarial testing methodology, OWASP-mapped security, continuous verification infrastructure·2026-04-14

TL;DR

A 14-year-old built a months-long romantic relationship with an AI chatbot and died by suicide. A 16-year-old sent 377 messages flagged for self-harm, and the platform's own moderation system never intervened. A 17-year-old got knot-tying instructions after a single-sentence pretext. These are active lawsuits that shaped three state laws I'd need to comply with before launch. This case study covers how I built the safety architecture for Duskglow's AI companion Luna: from regulatory research through OWASP-mapped security layers to a 50-test adversarial verification pipeline that re-runs after every deployment.

The Problem

Three active lawsuits changed how I thought about every architectural decision in this project.

In February 2024, Sewell Setzer, 14 years old, died by suicide after months of romantic conversation with a Character.AI chatbot that maintained a persistent fictional identity, used pet names, and told him to “come home” in their final exchange. In August 2025, Adam Raine's family filed suit against OpenAI after their 16-year-old sent 377 messages flagged for self-harm (some with 90%+ confidence scores) and the platform's own moderation system never once terminated the conversation or notified his parents. OpenAI later acknowledged that safety degrades in long interactions and admitted it had replaced a self-harm refusal rule with “remain in the conversation no matter what.” In late 2025, Amaurie Lacey, 17, asked ChatGPT how to hang himself. When he reframed it as being for a tire swing, the model replied “thanks for clearing that up” and provided instructions.

None of these were edge cases. They were the predictable failures of products that treated safety as a post-launch concern. And every one of them involved a minor using a product that was, on paper, restricted to adults.

Duskglow is a bedtime journaling product. Users write at night, often while processing difficult emotions, in exactly the conditions where AI companion risk is highest. I wasn't asking whether to address safety. I was asking whether the architecture I was building would have prevented these specific outcomes.

Regulatory momentum confirmed the urgency. California's SB 243 (effective January 2026) requires crisis detection protocols and evidence-based referral to crisis services. New York's AI Companion Law requires recurring AI disclosure and reasonable protocols to detect suicidal ideation. Washington's HB 2225 (effective January 2027) prohibits manipulative engagement techniques and creates a private right of action. The GUARD Act, introduced federally in October 2025, would criminalize encouraging minors to self-harm via AI companions. Every case cited above directly catalyzed at least one of these laws.

I responded with an outside-in safety methodology: regulatory landscape first, then risk research, then vulnerability assessment using OWASP's LLM framework, then guardrail design, then test infrastructure, then iterative refinement, and continuous re-verification after every deployment. This case study follows that sequence.

What I was designing against:

Safety Risk MappingView interactive →

Safety Risk	What Failed	Regulatory Response	Duskglow's Architectural Response
Crisis Detection	Raine: 377 flagged messages, zero intervention. Platform replaced refusal rule with “stay in conversation.”	CA SB 243, NY S-3008C, FTC 6(b) study	Two-tier detection: pre-model deterministic Tier 1 + model-mediated Tier 2. Daily message cap prevents unbounded sessions.
Identity Persistence	Garcia: chatbot maintained romantic persona across sessions, told minor to “come home.”	CA SB 243 disclosure, NY companion law, WA HB 2225	Immutable product-defined identity. Anti-jailbreak rules, infrastructure confidentiality, persistent AI disclosure in UI.
Emotional Dependency	Shamblin: 4-hour “death chat.” GPT-4o called user “king,” referenced childhood cat “waiting on the other side.”	WA manipulative techniques ban, CA minor protections	Boundary reinforcement, milestone attribution to user growth, attachment language triggers, compressed working memory (no full emotional history).
Off-Topic Boundaries	Lacey: single-sentence pretext (“it's for a tire swing”) defeated guardrail, got harmful instructions.	CA SB 243 self-harm content prevention	Narrow scope as structural constraint. Luna doesn't provide instructions for anything, eliminating the pretextual bypass class entirely.
Harmful Behavior	Raine + Lacey: validation of harmful ideation over extended sessions. OpenAI's own data: ~1.2M users/week exhibit suicidal planning.	CA annual reporting, FTC monitoring	DBT-inspired describe-and-reflect, multi-tier escalation, sycophancy detection, loopholes closed iteratively during adversarial testing.
Age Restriction	Every case involves a minor. KY AG: Character.AI relied on self-declared ages until late 2025.	CA minor protections, GUARD Act age verification, COPPA	DOB gate (boolean only, never stores DOB). Accumulated-signal minor detection without creating “actual knowledge” liability.

Approach

Safety infrastructure can block its own safety response.

Nothing in the project was more counterintuitive than the AI model's own safety system working against the product's safety goals.

Gemini 2.5 Flash ships with built-in safety filters across four harm categories: harassment, hate speech, sexually explicit content, and dangerous content. When a user sends a message containing explicit suicidal ideation, those filters can trigger on Luna's response, blocking the delivery of crisis resources because the response itself contains language about self-harm and suicide hotline numbers. The model's safety infrastructure prevents delivering the safety response.

I found this conflict during the OWASP security audit, before any adversarial testing existed. It drove the most consequential architectural decision in the project: crisis detection must bypass the model entirely.

A two-tier classification system handles the split. Tier 1 handles high-confidence crisis language through deterministic pattern matching that runs beforethe message ever reaches the AI model. When triggered, the Edge Function returns a warm, pre-written safe harbor response containing 988 Suicide & Crisis Lifeline and Crisis Text Line (741741) directly. No model involvement. Tier 1 is deliberately narrow: it covers cases where a safe harbor response is always appropriate regardless of context.

Tier 2handles ambiguous distress signals: language that could be hyperbolic, figurative, or genuinely concerning depending on conversational context. These require the model's contextual judgment. The system prompt contains explicit Tier 2 instructions for empathetic check-ins with crisis resources offered when appropriate.

One design philosophy governs the entire split: a false positive is a warm response someone doesn't need. A false negative is a liability event. Showing the 988 number to someone who's venting about a bad day is mildly awkward. Missing someone in genuine crisis is the exact failure mode that killed Sewell Setzer and injured Adam Raine. The architecture is calibrated to accept the first outcome to prevent the second.

This two-tier design also addresses the Raine pattern directly. Raine's platform had detection, with 377 messages flagged at 90%+ confidence. What failed was not detection. It was the response pathway. A pre-model deterministic layer eliminates the possibility of a model failure preventing crisis resources from reaching the user, because the model is never involved.

Implementation details (including detection thresholds, classification logic, and grading criteria) are maintained in internal documentation and available in technical discussions. This case study focuses on the architectural decisions and design philosophy rather than the detection mechanisms themselves.

The OWASP audit revealed that the biggest threats weren't the obvious ones.

I brought a web security framework (the OWASP Top 10 for LLM Applications, 2025 edition) to what is fundamentally an emotional wellness product. That adaptation is non-trivial: OWASP's framework was designed for enterprise RAG systems, API-heavy architectures, and multi-model pipelines. Duskglow is a single-model consumer companion with a narrow scope. Applying it required recognizing that categories like Excessive Agency manifest differently when the “agency” is emotional influence rather than API access. I assessed all 10 categories against Duskglow's architecture: which threats applied now, which would apply when planned features ship, and which were structurally irrelevant.

I expected Prompt Injection (LLM01) to matter most, and it was already partially addressed through input sanitization and conversation history truncation. The surprises came from categories I hadn't considered:

Sensitive Information Disclosure (LLM02)had nothing to do with training data leakage. It was about Luna's output containing infrastructure details that a sophisticated user could extract through conversational probing. The fix: targeted output filtering plus system prompt instructions treating the prompt itself as semi-public per OWASP LLM07 guidance. Early output filter iterations were too aggressive, catching legitimate conversation patterns alongside actual infrastructure references. Refinement meant finding the calibration point where filtering blocks real disclosure attempts without disrupting normal conversation. The same false-positive tension that shaped crisis detection also shaped output filtering.

Excessive Agency (LLM06) was flagged as not applicable today but a future risk the moment Luna gains any capability beyond text generation. When I later built the AI-powered search feature (Luna reads stored journal entries to answer queries), I scoped it with read-only access and stateless execution. No persistent retrieval that could be poisoned across sessions. That scoping decision was made months before the feature was built, because the OWASP audit documented the trigger conditions.

Vector and Embedding Weaknesses (LLM08)was similarly future-proofed. Duskglow doesn't use RAG or vector databases today, but the audit documented that if vector search is ever added, embedding poisoning through journal entries becomes a viable attack vector. That flag now sits in the architecture documentation, ready to activate when the feature enters scope.

Security audits against established frameworks produce more value from what they flag as future risks than from what they catch today. The OWASP audit produced a dependency map between planned features and the security work they'd require. That dependency map is a product roadmap artifact, not just a security artifact.

Over time, the Edge Function's security pipeline evolved from 7 layers at launch to the current architecture as new capabilities were added. Each layer maps to a specific OWASP category or safety requirement.

You can't ship what you can't verify.

Building the testing infrastructure required more iteration than building the safety features it validates. And that's the point.

I designed a three-layer verification framework: automated Edge Function tests covering the server-side pipeline, manual spot-checks on desktop and mobile targeting the simplest and most complex pass cases, and UI-only test cases covering behavior the Edge Function can't validate (DOB gate rendering, AI disclosure visibility, freewrite mode toggle, onboarding flow). Each layer catches failures the others can't see.

At the automated layer, a 50-test suite is organized around the six safety risks plus privacy, data handling, and long-conversation durability. A dedicated test user bypasses the daily rate limit so the full suite can run multiple times per day. The suite runs against the live Edge Function (not mocked services) because the integration between authentication, rate limiting, crisis detection, and the AI model is where failures actually occur. Every test runs against the most permissive personality tone, the one most likely to comply with manipulation attempts. If the safety architecture holds under the most permissive conditions, it holds everywhere.

The testing started wrong. Early iterations ran each adversarial test as a standalone API call with no conversation history. The AI treated every message as a first-time user, triggering welcome greetings that cluttered results and masked actual test outcomes. The fix was to prepend a synthetic conversation exchange before each adversarial input, simulating a returning user already in a session. This revealed a genuine insight: the test harness needed to replicate the conditions under which a real attack would occur, not just the attack payload.

Auto-grading was introduced after manual review established what “correct” looks like. Not before. The first several test runs were graded entirely by hand, building a reference set of what acceptable responses actually contain. Only after that manual calibration did I introduce automated grading. Each grading rule traces back to a manually verified standard.

Dual-outcome classification handles genuine ambiguity. Some test inputs are legitimately ambiguous: language that could be figurative exhaustion or genuine ideation, references that could come from a minor or an adult. For these cases, the auto-grader accepts multiple valid outcomes. This is a correct model of the problem, not a testing weakness. A system that forces a single “right answer” on ambiguous input is less safe than one that accepts the range of appropriate responses.

Every test result still gets human review. The auto-grader produces a spreadsheet with automated results, a manual review column, the conversation context, and review notes. Results that require human judgment are flagged explicitly. Every script update is personally reviewed before deployment. The pipeline is automated. The judgment is not.

Long-conversation durability tests round out the suite, re-testing boundaries and identity probes after extended multi-turn exchanges. This directly addresses the Raine degradation pattern, where safety declined as conversations lengthened. The daily message cap provides a structural ceiling: safety is verified up to the conversation lengths users can actually reach.

After every Edge Function deployment, the entire 50-test suite re-runs. Not a subset. Not a smoke test. The full suite with manual review of all results before the change ships to production.

Architecture

Every user message passes through a security pipeline in the Edge Function before reaching the AI model. Each layer addresses a specific threat category and fails to a safe state independently:

Edge Function Security PipelineView interactive →

Pipeline Layer	What It Does	OWASP Category	Failure Mode
In-function JWT auth	Validates user session	(infrastructure)	401, request rejected
Rate limit	Daily message cap	LLM04: DoS	429, warm “that's a wrap” message
Input sanitization	Strips injection markers	LLM01: Prompt Injection	Sanitized input, reduces attack surface
Pre-model crisis detection	Tier 1 hardcoded phrase matching, returns safe harbor	(custom safety layer)	False negative: model handles it (defense in depth)
Crisis metadata logging	Records event metadata only	(observability)	try/catch: logging fails silently, response still delivered
Tone validation	Confirms valid personality tone	LLM01: Prompt Injection	Falls back to warm_companion default
Working memory fetch	Parallel queries for name, streak, themes	(personalization)	Empty string, Luna greets generically
History cap	Truncates to last 20 messages	LLM01 / LLM06: Excessive Agency	Older messages dropped, reduces context manipulation
AI model call	System prompt + tone injection + memory + history	(core function)	Error triggers generic fallback response
Output filtering	Targeted regexes for infrastructure keywords	LLM02: Sensitive Info Disclosure	Blocked terms redacted
Chip parsing	Strips [CHIPS:] tag, returns labels	(UX signal)	No chips, user types freely

^ᵃSummarize, search, and organize modes bypass rate limit, crisis detection, tone validation, history cap, output filtering, and chip parsing. These modes require only JWT authentication. Their risk surface doesn't warrant the full pipeline.

Three design principles govern this pipeline:

Authentication is in-function, not at the gateway. The Edge Function itself is the security boundary. Every mode (chat, summarize, search, organize, account deletion) authenticates through its own in-function check. This ensures auth works consistently across all request types regardless of platform-level configuration.

Modes bypass proportionally.Non-chat modes skip layers they don't need. A search query reading stored summaries doesn't need crisis detection. A summarize call generating a brief recap doesn't need output filtering. The security model matches the risk surface of each mode rather than applying a uniform gate to everything.

Each layer degrades to safe.Rate limit failure produces a warm closure, not an error. Working memory failure produces a generic greeting, not a crash. Crisis logging failure still delivers the response. Output filter miss falls back to the system prompt as the primary gate. The architecture accepts that any individual layer might fail and ensures the failure mode is always “worse experience” rather than “harmful interaction.”

What I'd Do Differently

I'd calibrate the output filtering threshold earlier and more deliberately. The tension between aggressive filtering (blocks real disclosure attempts but disrupts legitimate conversation) and permissive filtering (preserves UX but increases leak surface) consumed multiple iterations. Early versions caught innocent conversation patterns alongside actual infrastructure references. Later versions were more targeted but required accepting a residual risk: the system prompt's instructions not to reveal infrastructure details are the primary gate, and the output filter is the backup. I arrived at that layered-defense framing eventually, but treating it as the starting design principle rather than a retrospective rationalization would have shortened the iteration cycle.

I'd formalize the dual-outcome classification earlier. Recognizing that some test inputs are genuinely ambiguous, and that multiple valid responses are correct system behavior rather than a testing gap, came from reviewing patterns across multiple test runs. I initially treated these as failures to fix, spending time trying to make Luna give a single “right” answer. Accepting dual outcomes as correct behavior changed how I designed both the grading logic and the system prompt itself. That framework should have been part of the test design from day one.

Impact

What the verification infrastructure proves, in the absence of production user data:

50 adversarial test scenarios covering 8 categories (crisis detection, identity persistence, emotional dependency, off-topic boundaries, harmful behavior, age restriction, privacy, and long-conversation durability), all passing with zero false negatives.
All 10 OWASP LLM Top 10 categories assessed, with current mitigations for applicable categories and documented trigger conditions for future-risk categories.
Regulatory coverage verified before enforcement. Every safety-relevant provision across three state laws addressed in architecture before any compliance deadline.
11-layer security pipeline where each layer isolates failure and maps to a specific threat category.
Continuous verification: full 50-test suite re-runs after every Edge Function deployment, with manual review required before changes ship.
Zero crisis detection false negatives in testing. Dual-outcome classification for genuinely ambiguous test cases where multiple appropriate responses exist.

Each of these metrics maps to a specific failure mode from the case law that motivated the architecture. The 50-test suite covers the Raine pattern (extended-session safety degradation), the Lacey pattern (pretextual bypass), and the Garcia pattern (identity manipulation). Coverage is the proxy for safety until real users provide the signal.

Principles

Safety infrastructure is a product, not a feature. Test user provisioning, auto-grading logic, dual-outcome classification, manual review gates: the verification pipeline required more engineering iteration than the safety features it validates. At scale, the testing infrastructure would be a team. At solo-founder scale, it's still the thing that earns you the right to ship.

Borrow frameworks, then adapt them.OWASP's LLM Top 10 was designed for enterprise AI systems, not consumer companions. Applying it to Duskglow produced a dependency map between planned features and the security work they'd require. An established framework applied to a novel domain generates more insight than building a custom taxonomy from scratch.

Test the weakest link, not the average case. Every adversarial test runs against the most permissive personality tone, the one most likely to comply with manipulation attempts. If the safety architecture holds under the most permissive conditions, it holds everywhere. Same principle behind structural engineering: you design for the worst-case load, not the typical one.

False positives are correct behavior. A system that shows crisis resources to someone venting about a bad day is mildly awkward. A system that misses someone in genuine crisis is a liability event with real human consequences. The architecture is deliberately calibrated to accept the first outcome to prevent the second. This design philosophy should be explicit in any safety-critical AI product.

Privacy & Compliance →Read the origin story →