01 Two layers
The lens has two passes. The first is a binary error-code check: fifteen specific failure modes that an output either contains or does not. The second is a gradient quality read, applied only to work that clears the first pass. Outputs that fail the first layer never reach the second.
02 Fifteen error codes, three dimensions
The error codes sit across three categories. Voice Commitment covers whether the narrator stays in register. Specificity covers whether the world is grounded in concrete detail. Consequence Weight covers whether actions land with cost the story honors. Each code has a failure description, a textual signal, a pass-fail rule, and a paired set of anchor passages.
Example Code VC-2 — Therapeutic drift
- Failure mode
- The narrator's worldview shifts mid-scene toward modern therapeutic vocabulary or self-aware moral framing the character would not possess.
- Textual signal
- Phrases like "processing my trauma," "in a healthier place," or any sentence where the character names their own pathology in language drawn from contemporary self-help.
- Pass-fail rule
- Fails on a single instance inside a committed first-person scene. Passes on stylized acknowledgment that remains inside the character's voice.
03 Reliability is measured, not assumed
Five trained annotators rate a hundred-passage subset against the codebook before launch. The launch report publishes per-code Fleiss' kappa, the disagreement cases, and the calibration changelog that produced the final codes. Where the codes hold up, the numbers show it. Where they do not, the report says so.
04 Triangulated reliability
Standard inter-rater designs assume two annotators on the same work. This project goes wider: five-annotator agreement on the binary codes, two judge models from different labs on the gradient layer, and a set of canonical literary anchor passages running underneath both. The structure is documented as a deliberate methodological choice.
05 A transfer protocol, with a leaderboard downstream
What ships in December is the protocol: the codebook, the anchors, the judge prompts, the reliability study, and a set of worked demonstrations across frontier models. The leaderboard sits downstream of the protocol, not the other way around. Anyone with access to the relevant APIs should be able to run it, audit it, and argue with it.