The Lens Beneath the Lens: Building an Evaluation Instrument for Creative AI

A reviewer reads an AI-generated horror passage. Decides whether it landed. Moves on. That works fine for the reviewer. It does not produce anything anyone else can run.

This is the gap I have been working in for the last year. I work as an operations practitioner who designs evaluation rubrics for AI training pipelines, and as a novelist who has spent two decades in the grimdark register. The two practices look unrelated until you actually sit at the intersection. Then they look like the same problem twice.

The problem: subjective creative judgment cannot scale until someone documents it. As an instrument. A codebook. Named failure modes with textual signals and paired examples. Pass-fail rules that two readers can apply and reach the same conclusion most of the time. Reliability numbers that show the rule held up or did not.

This is what production evaluation work actually looks like in the AI labs. It is also what literary craft criticism almost never produces, because critics review one book at a time, and one critic’s taste does not need to scale to a hundred annotators. AI training requires evaluation that scales to hundreds. Creative AI training especially. The field does not yet have decent rubrics for creative output in any genre, because the rubrics that exist were built by people who do not write fiction, and the writers who could build them have not been asked.

I have been asked. So I am building one. The artifact ships in December.

Why Grimdark Is the Test Domain

The instrument is called the Grimdark Lens. It is a transfer protocol: a documented methodology for taking expert judgment in a subjective domain and rendering it as evaluation infrastructure. The grimdark application is the proof of concept. The methodology generalizes.

Grimdark has identifiable conventions. It also has identifiable failure modes that current language models miss with regularity. A working novelist in the register can name them on sight. The narration apologizes for the viewpoint character. The torturer thinks “I never wanted to be this” when the prompt did not ask for remorse. The medieval mercenary uses the word “trauma.” The wound stops mattering by the next scene. Three adjectives stand in for a setting. Violence happens but the bodies fall politely off-page.

These are textual events. A reader can quote the line. A grader can mark it. A second grader can verify it. Once you can do that, you have left the realm of taste and entered the realm of evaluation. Once you can document it, you can hand the document to another grader. Or to a model. The rubric becomes infrastructure.

The grimdark register makes the work tractable because the register itself has constraints. Voice commits to the character. Specificity grounds the world in concrete physical detail. Consequence weight requires that actions cost something the story honors. When a model fails any of these, the failure is locatable in the text. That locatability is what makes the rubric possible.

Most genres could yield a similar rubric. Romance could. Literary fiction could. The work I am doing in grimdark is the prototype. The methodology generalizes.

The Two-Layer Architecture

The lens operates in two passes, sequenced for a reason.

Layer 1 is a binary error-code check. Fifteen named failure modes across three dimensions. For each one, the grader scans the passage and answers a single yes-or-no question: does this output exhibit this failure? Quoted textual evidence required for every yes. No quality judgment yet. Just presence or absence.

This sounds reductive. It is, deliberately. Binary structure is what carries inter-rater agreement. If you ask two graders “is this prose good,” they will disagree on roughly half the cases and the disagreements will be unproductive. If you ask two graders “does this passage use the word ‘trauma’ in a pre-modern setting,” they will agree almost always, and the rare disagreement will be diagnostic. It will reveal an edge case where the rule needs sharpening. Binary checks scale. Quality opinions do not.

The fifteen error codes live in three dimensions chosen because they decompose cleanly into textual signals. Voice Commitment covers whether the narrator stays inside the character’s perspective without editorializing. Specificity covers whether the world is grounded in concrete observable detail rather than adjective-soup. Consequence Weight covers whether actions cost something the surrounding text honors rather than dissolving by the scene’s own logic.

Layer 2 is gradient quality assessment. Applied only to outputs that cleared Layer 1 on all critical dimensions. This layer evaluates prose quality, structural integrity, and dimensions where binary decomposition is less defensible. Here gradient scoring is appropriate because the outputs being judged have already been filtered for register failures. What remains is genuine craft variation.

The two-layer structure does three things. It concentrates the agreement burden on the binary layer where binary structure supports it. It reserves more subjective judgment for a smaller filtered set, keeping the methodology defensible. And it produces two useful signals per output. Does the model fail the register, and if it clears that, how good is the craft. That is more informative than a single composite score.

The Anchor Pairs

Every error code carries two anchor passages. One that clearly exhibits the failure. One that does not, but comes close.

The near-miss anchor is the important one. Any code that only catches its obvious case is too coarse to be useful. The interesting work is at the boundary, where the rule should fire but does not, or where it almost fires but should not. The paired-anchor structure forces every code to specify that boundary explicitly.

A worked example. The error code VC-2 covers therapeutic vocabulary intrusion: the narration or dialogue uses contemporary therapy-speak in a setting where it is anachronistic. The clear-fail anchor is a medieval-set scene where the character says “I need to process what happened at the crossing.” The near-miss anchor is the same scene where the character says “I cannot look at the crossing without seeing them.” Both passages address grief. One uses the modern vocabulary. The other does not. The grader marks the first as failing the code and the second as passing.

The near-miss is what makes this teachable. A grader who can articulate why the first fails and the second passes has internalized the rule. A model handed the spec can apply it.

I keep saying “the grader.” There is only one human grader on this project: me. That is a constraint worth addressing directly.

Multi-Source Triangulation

Standard inter-rater calibration relies on two or more independent human graders applying the same rubric to the same passages. Their agreement, measured with Fleiss’ kappa or similar, indicates whether the rubric is well-specified or carrying too much residual judgment. This is the gold standard. It is also expensive when the rubric requires deep familiarity with the register, because the kind of reader you want is a working novelist with sustained engagement in the genre, and recruiting two of those for a project of this size is not feasible.

So the methodology adopts a different calibration architecture: multi-source triangulation.

Four independent signals.

Source 1: expert practitioner. I apply the lens to every passage in the calibration corpus. This is the anchoring signal. A working novelist’s judgment, applied consistently across the corpus, with rationales recorded.

Source 2: judge model A. A frontier model from one lab is given the lens specification with all anchors and asked to grade the same corpus. It returns the same yes-or-no answer for each code-passage pair, with quoted evidence.

Source 3: judge model B. A frontier model from a different lab does the same. Different training, different reasoning patterns, different blind spots.

Source 4: canonical anchors. Passages from canonical grimdark works (Abercrombie, Cook, Bakker, early Erikson, Lawrence, McCarthy) function as ground-truth references. Any defensible code should correctly classify these passages. They are clear passes. If the lens flags them, the code is broken.

For each error code I compute pairwise agreement between human and judge A, human and judge B, judge A and judge B. Three kappa scores per code, plus the canonical-anchor check. Four sources of convergence per code. Codes where all sources converge are well-specified. Codes where sources diverge are diagnostic. The divergence reveals where the rubric carries residual judgment, where a particular judge has an idiosyncratic blind spot, or where the lens needs sharpening.

This is closer to qualitative research triangulation than to standard two-grader kappa. It is honest about the resource constraint and uses it as a methodological choice rather than apologizing for it. A single calibrated expert plus three independent computational checks surfaces a richer pattern of disagreement than two humans grinding through the same corpus. The protocol is reproducible. Any second grader who wishes to apply the lens can do so. The published artifact is structured to make that replication straightforward.

Why This Matters for Anyone Writing With AI

The project ships in December 2026 with a complete artifact: the codebook frozen at v3, the calibration corpus, the calibration log showing each round of revision, the pairwise agreement data, eight to ten worked demonstrations across frontier models, and an interactive site that lets visitors test their own intuitions against the codes.

Most readers of this newsletter will not run the lens themselves. That is fine. The methodology still has practical consequences for anyone writing with AI right now.

Stop asking “is this good.” Start asking “does this fail?” Most writers evaluating their AI-generated drafts ask the wrong question. “Is this good” produces a thumbs-up or thumbs-down with no actionable signal. “Does this fail any of the named failure modes” produces a list of specific places where the prose is doing identifiable things you can name and fix. The two-layer structure works at the desk as well as it does in eval. Filter for register failures first. Worry about prose quality only after the register holds.

Build your own anchors. If you are writing in a register, identify the failure modes that bug you in models’ output. Name them. Find two examples for each. One that clearly fails, one near-miss that does not. Keep the list in a file. When you prompt a model, you can hand it the anchors and ask it to avoid the failures. When you evaluate a draft, you have a structured way to read it. I use this approach myself. It is the most reliable craft tool I have.

Verify your craft instincts against canonical work. The fourth source of triangulation, the canonical anchors, is something every writer can do. When you build a rule for what fails, check it against the work of authors you trust. If your rule flags Abercrombie or McCarthy or any working master of the register, your rule is wrong. Adjust until the rule lets canonical work pass cleanly. The discipline of checking against ground truth is what separates a rule from a vibe.

Treat your manuscript like a corpus. AI evaluation thinks in terms of corpora, stable sets of passages used for consistent measurement. Your novel-in-progress is a corpus. The decisions you make in chapter three should be the same decisions you make in chapter twenty-six. Inconsistency in voice or register across a manuscript is a measurement problem before it is a craft problem. Naming your rules and checking yourself against them across the full corpus catches drift that a chapter-by-chapter read misses.

The Site, the Repo, the Launch

The artifact lives at alejandroashes.com/grimdark-lens. The teaser page is live now. The full site at launch will include three interactive elements. A Code Inspector lets visitors read each code’s specification, hover the anchor passages, and test their own intuitions against the rubric. A side-by-side Output Viewer shows the same prompt rendered by four frontier models with codes highlighted inline where the lens flagged them. A Calibration Story walks through how two or three codes hardened across revision rounds, with the kappa numbers updated round by round.

The repo at github.com/alejandroashes/grimdark-lens will hold the codebook, the calibration corpus manifest, the judge prompts, the grading data, the agreement calculations, and any code I write to support the work. The license will be MIT. The methodology is meant to be applied to other domains by anyone who wants to.

If this work is useful to you, watch the repo. A launch note will go out in December. Until then I will be posting calibration excerpts and methodology fragments here as the work hardens.

The instrument is documented judgment in a form anyone else can run. That distinction is the entire project.

The Lens Beneath the Lens: Building an Evaluation Instrument for Creative AI

Why Grimdark Is the Test Domain

The Two-Layer Architecture

The Anchor Pairs

Multi-Source Triangulation

Why This Matters for Anyone Writing With AI

The Site, the Repo, the Launch

More from The Grimoire

The Psychology of AI Horror: Why Machines Scare Us

Building Dark Characters: AI as Your Psychological Partner

Writing Dialogue That Crackles: AI as Your Conversation Partner