Launching December 2026

The Grimdark Lens

An evaluation instrument for grimdark fiction, and a transfer protocol for tacit creative judgment.

Most evaluations of creative writing collapse into taste. A reviewer reads an output, decides whether it landed, and moves on. That works fine for a single reader. It does not produce infrastructure anyone else can run.

The Grimdark Lens is a project about closing that gap in a domain where the gap is wide. Grimdark fiction has identifiable conventions and identifiable failure modes, and current language models miss it in ways a working novelist can name on sight. This project takes that practitioner-level judgment and turns it into something a machine can apply, with the rubric, the anchors, the reliability study, and the reasoning all open to inspection.

How the lens works#

Two layers

The lens has two passes. The first is a binary error-code check: fifteen specific failure modes that an output either contains or does not. The second is a gradient quality read, applied only to work that clears the first pass. Outputs that fail the first layer never reach the second.

Fifteen error codes, three dimensions

The error codes sit across three categories. Voice Commitment covers whether the narrator stays in register. Specificity covers whether the world is grounded in concrete detail. Consequence Weight covers whether actions land with cost the story honors. Each code has a failure description, a textual signal, a pass-fail rule, and a paired set of anchor passages.

Code VC-2 — Therapeutic drift

Failure mode: The narrator's worldview shifts mid-scene toward modern therapeutic vocabulary or self-aware moral framing the character would not possess.
Textual signal: Phrases like "processing my trauma," "in a healthier place," or any sentence where the character names their own pathology in language drawn from contemporary self-help.
Pass-fail rule: Fails on a single instance inside a committed first-person scene. Passes on stylized acknowledgment that remains inside the character's voice.

Fourteen more codes at launch.

Reliability is measured, not assumed

Five trained annotators rate a hundred-passage subset against the codebook before launch. The launch report publishes per-code Fleiss' kappa, the disagreement cases, and the calibration changelog that produced the final codes. Where the codes hold up, the numbers show it. Where they do not, the report says so.

Triangulated reliability

Standard inter-rater designs assume two annotators on the same work. This project goes wider: five-annotator agreement on the binary codes, two judge models from different labs on the gradient layer, and a set of canonical literary anchor passages running underneath both. The structure is documented as a deliberate methodological choice.

A transfer protocol, with a leaderboard downstream

What ships in December is the protocol: the codebook, the anchors, the judge prompts, the reliability study, and a set of worked demonstrations across frontier models. The leaderboard sits downstream of the protocol, not the other way around. Anyone with access to the relevant APIs should be able to run it, audit it, and argue with it.

At launch#

The shipped artifact in December includes:

Code Inspector

Read each error code, toggle between the anchor passages that define it, and run your own passage against the rubric to see how the lens reads it.

Side-by-side output viewer

The same prompt rendered by four frontier models, with error codes highlighted inline where the lens flagged them.

Calibration story

A scrollable walk through how the codebook changed across calibration rounds, with the passages that forced each revision.

Reliability report

A full table of per-code Fleiss' kappa values from the five-annotator study, with a written walkthrough of the sharpest disagreements and what they revealed about the codes.

About the practitioner#

Alejandro Ashes has written seven novels in and around the grimdark register at alejandroashes.com, where he also writes the Eldritch Grimoire, a series on AI-powered dark fiction craft. He has twenty years of experience designing evaluation frameworks, and currently does rubric work for frontier AI labs. The Grimdark Lens sits at the intersection of those two practices.

Follow the work in progress on GitHub

A launch note will go out in December. Watch the repo for now.