The Grading Rubric Is Deliberately Brutal
The credential isn't the score — it's the standard the score is produced under. The full 12-rule, self-adversarial grading protocol, where every rule cuts…
The grading rubric is deliberately brutal — the full self-adversarial protocol
The credential is not the score; it is the standard the score is produced under. All twelve rules below cut AGAINST the operator, on purpose — and the rubric is symmetric: apply it to anyone.
- Sealed before the event. The exact text is published to public infrastructure before its window, canonicalised objectId|dateIssued|title|claim, SHA-256-hashed, and Bitcoin-anchored via OpenTimestamps — anteriority is cryptographic, not asserted; it cannot be edited, backdated, or deleted.
- Named, specific, unhedged. The actor, place, window, and mechanism are named in advance — not “tensions may rise.” Barnum vagueness earns nothing; specificity is what is graded.
- The rubric is fixed AT SEAL TIME, before the outcome is known — no moving goalposts — and graded against the sealed claim, never a post-hoc reinterpretation.
- Four levels, half-credit only. HIT = 1, NEAR / PARTIAL = 0.5, MISS = 0; no generous rounding, and a compound claim can lose credit on a single limb.
- Misses are kept in the denominator at FULL weight, forever. An anchored seal cannot be un-sealed, so every MISS stays on the public ledger and raises the Brier (Maharashtra 2024, the US-2024 down-ballot misses, the India-2024 markets miss — all displayed, not dropped).
- probAtSeal is the confidence the sealed words stated, never tuned up after the fact; “exceeds X%” is a floor, not a target.
- Calls softened for public safety are scored BELOW their real conviction — a terror warning understated for safety is graded at the lower number, penalising caution.
- The luck test is set entirely in the skeptic's favour: correlated calls clustered into one event, an event counts only if EVERY call inside grades HIT, and the per-event prior is floored at a coin flip (0.5).
- The record refuses free wins: correct calls are dropped on principle (the Mamdani frontrunner call as too easy, prevention-paradox calls as unfalsifiable, ethics-declined calls kept out even where they would grade HIT). Selection removes flattering material, not unflattering.
- The ceiling is conceded, first: the paper states plainly that on this small, high-confidence sample a base-rate baseline ties the aggregate Brier.
- Grades are self-assigned and say so, graded against the operator — RU-020 is PARTIAL, not HIT, despite four Russian state voices converging on the sealed scenario, because the kinetic event did not occur.
- All of it recomputes — reproduce-paper.mjs re-derives every statistic, verify-jyotint.mjs re-checks every seal, and /regrade lets you re-score the whole corpus under your own verdicts.
The symmetric test
Take this exact rubric and apply it to RAND, Stratfor, Palantir, Recorded Future, the Good Judgment Project, Metaculus, or any IC product. The finding is not that they score badly — it is that there is no dated, named, unhedged, misses-in, per-call recomputable record to grade at all: the standard is not failed, the game is not played. That is Philip Tetlock's own documented result (Expert Political Judgment, 2005; the Good Judgment Project): forced into dated, scored accountability with retained misses, the average expert barely beats chance. So the honest comparison is never “our Brier versus theirs” — it is that this desk submits to a rubric no one else does, and publishes every miss it produces. See /math, /peers, /regrade.