# A Sealed, Bitcoin-Anchored, Self-Adjudicated Forecasting Protocol: Falsifiability, Calibration, and Information Yield over a Public Corpus of Space and Geopolitical Forecasts

**Vijay Jyotish**
Vijay Jyotish LLC
ORCID: [0009-0009-0832-9998](https://orcid.org/0009-0009-0832-9998)
Correspondence: contactus@vijayjyotish.com

*Preprint — methods note. Version of 2026-06-29. Licensed CC-BY-4.0.*
*Companion dataset: JYOTINT Sealed Forecast Corpus, DOI [10.5281/zenodo.20630257](https://doi.org/10.5281/zenodo.20630257).*

---

## Abstract

Public forecasting records are, as a class, difficult to audit. Predictions are routinely edited, deleted, retro-dated, or selectively recalled after the fact, and the absence of a tamper-evident anteriority proof makes it impossible for a third party to distinguish a forecast that was genuinely committed *before* an event from one reconstructed afterward. This note documents a protocol that addresses the anteriority and falsifiability problem directly, and that separately scores the resulting record with standard proper scoring rules. Each forecast is published verbatim to public infrastructure before the event window, canonicalized as the string `objectId|dateIssued|title|claim`, hashed with SHA-256 into a manifest, and the manifest is committed to the Bitcoin blockchain via OpenTimestamps; anyone can independently recompute the hashes and verify the time-anchor with a zero-dependency verifier. A pre-committed grading rubric (HIT / NEAR / PARTIAL / MISS) is fixed at seal time, and missed calls are retained in the denominator. Over a corpus of 68 graded forecasts (58 HIT, 3 NEAR, 3 PARTIAL, 4 MISS) the self-assigned Brier score is 0.0717 (log-loss 0.322); the launch-timing subset (n = 23) scores 0.0360. We report calibration against the IPCC AR6 calibrated-likelihood bands, and an information-theoretic statistic — Information Yield — that measures bits of surprise-if-true per call (median ≈ 8.8 bits). We state the limitations prominently and first: the probabilities and grades are **operator-assigned, not independently adjudicated**; the sample is small and high-confidence-weighted; the generative method is a disclosed unconventional one (Vedic *jyotish*). The paper makes **no claim about the validity of that mechanism**. Its contribution is narrow and checkable: the *protocol* — sealing, falsifiability, retained misses, and reproducible scoring — not a claim that any particular method works.

---

## 1. Introduction

### 1.1 The problem

A forecast is only evidence if it can be shown to have existed, in its claimed form, before the event it forecasts. Most public prediction records fail this test. Social-media posts can be edited or deleted; "I called it" claims are made after the fact with no contemporaneous artifact; ambiguous wording lets a forecaster claim credit for almost any outcome (the well-documented hindsight, vagueness, and "I-was-basically-right" biases that Tetlock and colleagues quantify in elite forecasters; see Tetlock, 2005). Even when a record is honest, a reader has no mechanical way to *audit* it: there is typically no time-stamp that a third party can verify without trusting the forecaster, no fixed scoring rule, and no guarantee that the failures were retained alongside the successes.

These are two distinct problems that are usually conflated:

1. **Anteriority and integrity** — *did this exact claim exist before the event, and is it unchanged since?* This is a cryptographic and archival question, independent of whether the claim was any good.
2. **Skill** — *given that the claims are genuinely anterior, do they carry predictive information?* This is a statistical question, answered with proper scoring rules (Brier, 1950; Gneiting & Raftery, 2007) and calibration analysis.

A credible record must dispose of (1) before (2) is even meaningful, because without (1) any apparent skill could be an artifact of post-hoc editing or selective recall.

### 1.2 Contribution

This note describes a protocol that makes both questions independently checkable by a third party, and applies it to a live corpus:

- **A cryptographic anteriority layer.** Every forecast's exact text is published to public infrastructure (X, YouTube) before its event window, then canonicalized, SHA-256–hashed into a manifest, and the manifest digest is committed to the Bitcoin blockchain via OpenTimestamps (Todd, 2016), whose lineage traces to Haber & Stornetta (1991). The block time precedes the event date.
- **A falsifiability layer.** A grading rubric is pre-committed at seal time. Outcomes are graded against the *sealed* claim, on a four-level scale, with **misses retained in the denominator**. The grades are frozen in a separately anchored ledger.
- **A scoring layer.** Standard proper scoring rules (Brier, log-loss) plus IPCC-AR6-style calibration bins, and an information-theoretic statistic (Information Yield) that measures the against-consensus surprise of each call.
- **A reproducibility layer.** A zero-dependency public verifier and a CC-BY corpus let anyone recompute every hash, the time-anchor, and the Brier from frozen inputs — removing the author from the trust chain.

We are explicit, here and throughout, that this is a *protocol* contribution. The scoring is **self-assigned** and the corpus is small; none of what follows should be read as an independently validated skill claim, nor as a claim about the validity of the underlying generative method. Section 6 develops these threats to validity in detail and is, deliberately, the most important section of the paper.

---

## 2. The sealing protocol

### 2.1 Canonical string and hash

Each forecast object is reduced to a canonical four-field string:

```
objectId | dateIssued | title | claim
```

where `objectId` is the identifier of the public post that carries the verbatim forecast (e.g. an X status ID or YouTube video ID), `dateIssued` is the publication date, `title` is the human-readable label, and `claim` is the full sealed text. This string is hashed with SHA-256 to produce a per-record digest. All per-record digests are collected into a manifest (`seal-manifest.json`), which itself has a manifest digest (`manifestHash`). The pipe-delimited template is published in the manifest header (`hashInputTemplate`), so the canonicalization is not a secret: any party can reconstruct the exact preimage from the published fields and confirm the hash.

### 2.2 Anteriority versus integrity

We distinguish two properties the seal provides, because they rest on different evidence:

- **Anteriority** (the claim existed by time *t*) rests on two independent witnesses: (a) the public-platform upload timestamp of the post carrying the verbatim claim, and (b) the Bitcoin block that the manifest is committed into. The block's median-time-past precedes the forecast's event date.
- **Integrity** (the claim is unchanged since stamping) rests on the OpenTimestamps proof: the `.ots` attestation commits the manifest digest into the blockchain, so any later edit to the sealed string changes its SHA-256 and breaks the commitment.

This separation matters for honest reporting: the seal proves *when a claim was fixed and that it has not changed* — it proves nothing about whether the claim was correct, specific, or generated by any valid process. Those are separate questions handled in §§3–4.

### 2.3 Bitcoin anchoring via OpenTimestamps

The manifest digest is timestamped with OpenTimestamps (Todd, 2016), which aggregates many digests into a Merkle tree and commits the tree root into a single Bitcoin transaction, yielding a compact, independently verifiable proof per document without a trusted third party. The design follows the digital-timestamping lineage of Haber & Stornetta (1991) and the linked/aggregated-timestamp tradition (Haber & Stornetta, 1997; Bayer, Haber & Stornetta, 1993), with the blockchain replacing the trusted timestamping authority. The forecast grades are frozen in a **separate** ledger (`grading-ledger.json`) with its **own** anchor, so that a later re-grade can never silently move the claim-anteriority proof.

### 2.4 Independent recomputation

The trust chain is designed to exclude the author. A published, dependency-free verifier (`verify-jyotint.mjs`, Node.js) lets any party fetch the live manifest, recompute the SHA-256 of every record from the four published inputs, confirm the manifest digest, and check that the `.ots` proof commits to that exact manifest. Distinct exit codes report (1) manifest-integrity drift, (2) per-record hash mismatch, (3) I/O failure, (4) an `.ots` proof that does *not* commit to the manifest (the swap/backdate handle), and (5) grading-ledger drift (frozen grades or the Brier failing to recompute). A reader who runs the verifier need not trust any claim made on the website.

---

## 3. Grading

### 3.1 Pre-committed rubric, fixed at seal time

Each forecast is scored on a four-level outcome scale, with the pass/fail criteria committed at the time of sealing against the *sealed wording*:

| Verdict  | Outcome value | Meaning |
|----------|:-------------:|---------|
| **HIT**     | 1.0 | The sealed claim's named outcome/mechanism was realized. |
| **NEAR**    | 0.5 | Right dynamics, materially different realized cause or margin. |
| **PARTIAL** | 0.5 | Part of a compound claim realized; part not. |
| **MISS**    | 0.0 | The sealed claim's outcome was not realized. |

Grading is performed against the claim as sealed in `seal-manifest.json`, not against a charitably re-read version. Where a forecast's sealed mechanism is not externally named by a primary source, the call is graded conservatively (e.g. a launch slip attributed by the agency to a cause other than the sealed one is graded NEAR, not HIT).

### 3.2 Misses retained in the denominator

The single most important integrity rule is that **failures are not removed**. The corpus contains 4 graded MISSes (3 in the U.S.-2024 state-level set, 1 in the Maharashtra-2024 set), and these contribute their full Brier penalty. No call is retired, re-scoped, or down-weighted after a bad outcome. The published example most often cited internally as the discipline check is the Maharashtra-2024 MISS (`IA-MH24-001`), which was sealed at p = 0.78 and graded 0, contributing a Brier term of 0.608.

### 3.3 Frozen grading ledger

After grading, each call's `probAtSeal` (the probability mapped from its sealed confidence wording onto the IPCC band ladder), `ipccBand`, `outcome`, `outcomeValue`, and `brierTerm = (probAtSeal − outcomeValue)²` are frozen into `grading-ledger.json`. The corpus Brier is the mean of the frozen `brierTerm` values, so it is exactly reproducible from the ledger (`brierFromLedger` = `brier` = 0.0717). The ledger is canonical-JSON hashed (`ledgerHash`) and separately time-anchored, so a re-grade is itself a dated, visible event rather than a silent overwrite.

A clarification of scope: `probAtSeal` is *audited*, meaning it is reconciled post-event to the contemporaneous sealed text using a fixed linguistic ladder (a flat "will/cannot" → 0.95; "likely" → 0.78; a genuine "could/might" hedge → 0.60; launch calls mapped to their stated seal-time confidence), applied identically to hits and misses. "Audited" denotes this reconciliation to the sealed wording — **not** blind third-party verification.

---

## 4. Metrics

All metrics below are computed from the same frozen, published inputs and are recomputable by a reader.

### 4.1 Brier score

For binary-resolved forecasts the Brier score (Brier, 1950) is the mean squared error between the forecast probability *pᵢ* and the realized outcome *oᵢ*:

$$ \mathrm{BS} = \frac{1}{N}\sum_{i=1}^{N}(p_i - o_i)^2 $$

with partial outcomes encoded as *oᵢ* = 0.5 (NEAR / PARTIAL). Lower is better; 0 is perfect, 0.25 is the always-50/50 baseline. Over the full corpus (N = 68) the self-assigned Brier is **0.0717**.

**Baselines and the skill caveat.** We report the Brier against reference points rather than as a bare number: an always-50/50 forecaster scores 0.25; a uniform prior over the outcome partition scores 0.20; published GJP "superforecaster" performance is ≈ 0.149 (Tetlock & Gardner, 2015; Mellers et al., 2014). **However**, on a small, high-confidence-weighted, operator-selected sample, a Brier in this range can be *matched by a naive base-rate baseline*, so 0.0717 is reported as a property of this record, **not** as a demonstration of calibration skill (see §6.2). A proper-scoring framing (Gneiting & Raftery, 2007) is what makes the number meaningful at all: because the Brier is strictly proper, an honest forecaster minimizes expected loss only by reporting true beliefs, so the rule cannot be gamed by hedging — but it also cannot, on its own, separate skill from a favorable base rate on a small sample.

### 4.2 Log-loss

We additionally report the logarithmic score (the negative log-likelihood, also strictly proper; Good, 1952):

$$ \mathrm{LogLoss} = -\frac{1}{N}\sum_{i=1}^{N}\big[o_i\ln p_i + (1-o_i)\ln(1-p_i)\big] $$

Over the corpus this is **0.322**. Log-loss penalizes confident errors more harshly than the Brier and is reported alongside it as a second, independent proper-scoring view.

### 4.3 Calibration: IPCC-AR6 reliability bins

Forecast probabilities are mapped onto the IPCC calibrated-likelihood vocabulary (Mastrandrea et al., 2010, the AR5/AR6 guidance note; IPCC, 2021): *Virtually certain* (≥ 0.99), *Extremely likely* (0.95–0.99), *Very likely* (0.90–0.95), *Likely* (0.66–0.90), *More likely than not* (0.50–0.66), and the symmetric lower bands. Within each band we report the count, mean forecast probability (`meanP`), and mean realized outcome (`meanO`); a well-calibrated forecaster has `meanO` ≈ `meanP` in each band. The occupied bands of the corpus are:

| IPCC band | Range | n | mean *p* | mean outcome |
|-----------|-------|:-:|:--------:|:------------:|
| Virtually certain | 0.99–1.00 | 3 | 0.99 | 1.00 |
| Extremely likely | 0.95–0.99 | 8 | 0.95 | 1.00 |
| Very likely | 0.90–0.95 | 17 | 0.90 | 0.97 |
| Likely | 0.66–0.90 | 37 | 0.80 | 0.82 |
| More likely than not | 0.50–0.66 | 3 | 0.60 | 1.00 |

A direct and disclosed limitation is visible in the table itself: **the entire corpus sits in the upper half of the probability scale.** There are zero forecasts in the *About as likely as not* band or any band below it. This is a structural feature of an operator who seals only high-confidence calls, and it limits what the reliability diagram can establish (§6.3).

### 4.4 Information Yield (bits of surprise-if-true)

Calibration (§4.3) measures whether the probabilities are honest; it says nothing about whether the calls are *informative*. A forecaster who only ever predicts the consensus favorite can be perfectly calibrated and yet carry no information over "what everyone already knew." To separate these, we define **Information Yield (IY)**, an information-theoretic statistic measuring the against-consensus surprise of a call.

For a call whose realized outcome was priced by the consensus/base rate at 1-in-*N*, the **surprise-if-true** is the self-information

$$ \mathrm{bits} = \log_2(N) $$

A consensus or near-certain call (*N* ≈ 1) carries ≈ 0 bits by construction; a base-rate forecaster or consensus-follower scores **zero bits** — which is the point. Bits are capped at 1-in-a-million (≈ 19.93 bits) and are **not** banked beyond that ceiling. Bits *earned* on a settled call are surprise × outcome-credit (HIT 1.0; NEAR/PARTIAL 0.5; MISS 0).

Over the 68 graded calls the **median surprise-if-true is ≈ 8.8 bits/call** (the typical sealed call named an outcome the consensus priced near 1-in-480), and the record earned **≈ 94%** of the available information (HITs/NEARs against the 4 MISSes). The launch subset is the most informative slice (median ≈ 19.9 bits); the intel subset is lower (median ≈ 7 bits). The boldest sealed HITs sit at the 1-in-a-million cap (≈ 19.9 bits each).

Two deliberate restraints preserve honesty: (a) the headline is a per-call **median**, never a compounded total or Bayes factor — because the priors are self-assigned and the calls are **not** independent, summing log-bits would imply an indefensible cumulative odds; (b) the priors (the 1-in-*N* values) are operator-assigned, the same caveat as the Brier — IY measures *evidence that the method beat chance on these calls*, never calibration skill, and recomputes if the priors are revised.

### 4.5 SITA (decision-value)

A separate, partly rubric-judged index, SITA, scores the *decision-value* of a call on four axes — **S**pecificity (the who/what/where/when/why/how/recommendation vectors; computed), **I**mprobability (the published 1-in-*N*; computed), impac**T** (consequence/stakes; tiered from the call's own impact block), and **A**ctionability (was a recommendation named, was there lead time, a crisp go/no-go, a still-changeable outcome; derived). The composite is a transparent weighted mean (rigor axes S+I = 40%, value axes T+A = 60%) on 0–100; the strongest calls reach a composite of 96. SITA is reported here for completeness and to make the distinction explicit: IY measures *how much a skeptic's belief should move*; SITA measures *how much a decision-maker should care*. Because two of SITA's axes are rubric-judged, it is the least objective of the metrics and is treated as descriptive, not evidential.

---

## 5. Results

### 5.1 Corpus

The graded corpus comprises **68 closed forecasts** spanning two domains: launch-timing advisories (LA-series, n = 23) and geopolitical/intelligence advisories (IA-series — Russia–Ukraine, U.S.-2024, India-2024, and others, n = 45). The verdict mix is:

| Verdict | Count |
|---------|:-----:|
| HIT | 58 |
| NEAR | 3 |
| PARTIAL | 3 |
| MISS | 4 |
| **Total closed** | **68** |

The seal manifest contains 78 records in total; the difference comprises sealed-but-deliberately-ungraded items (e.g. dated readings excluded on ethical grounds, or calls whose mechanism is not externally named) that are kept out of the scored denominator and clearly flagged.

### 5.2 Headline metrics

| Metric | Value | Note |
|--------|:-----:|------|
| Brier (full corpus, N = 68) | **0.0717** | self-assigned; see §6.2 |
| Brier (launch subset, n = 23) | **0.0360** | LA-series only |
| Brier (intel subset, n = 45) | 0.0899 | IA-series only |
| Log-loss (full corpus) | 0.322 | second proper score |
| IY median surprise-if-true | ≈ 8.8 bits/call | ≈ 1-in-480 |
| IY information earned | ≈ 94% | of available bits |

The launch-timing subset scores better (0.0360) than the geopolitical subset (0.0899), as expected: launch outcomes are externally and unambiguously adjudicated (an FAA mishap finding, a NASA notice, the provider's own root-cause statement, an independent review board), whereas geopolitical outcomes are noisier and harder to resolve cleanly.

### 5.3 Externally-adjudicated subset

A subset of launch calls has outcomes named by a *primary external source*, so the verdict is mechanical against that named outcome rather than author-adjudicated. Examples include Starship IFT-2 (both stages lost; FAA mishap closure accepting SpaceX root causes; HIT), Vulcan/Peregrine (Vulcan nominal, Peregrine lost to a propulsion-valve over-pressure per the Astrobotic independent review board; HIT), and New Glenn NG-3 (booster recovered, second-stage underperformance and payload loss, FAA grounding per Blue Origin and FAA; HIT). This externally-adjudicated subset is the strongest part of the record because, for these rows, the *outcome* does not depend on the author — only the anteriority and the grade rubric do.

---

## 6. Limitations and threats to validity

This section is deliberately placed before the conclusion and is the most important part of the paper. The reader should weight it accordingly.

### 6.1 The scoring is self-assigned, not independently adjudicated

The Brier, the log-loss, the calibration bins, and the IY priors are all computed from **operator-assigned** probabilities and **operator-assigned** grades. There is no independent adjudicator, no pre-registration on a third-party platform (e.g. a public forecasting tournament), and no blinded scorer. The grading *rubric* is pre-committed and the *anteriority* of each claim is cryptographically provable by a third party — but the probability attached to each claim, and the HIT/NEAR/PARTIAL/MISS verdict, are the author's. A skeptical reader should treat every score in §§4–5 as *self-reported under a fixed, public, recomputable procedure*, which is stronger than an unstructured track record but weaker than independent adjudication.

### 6.2 The Brier is not, by itself, a skill claim

On a small, high-confidence-weighted, operator-*selected* sample, a Brier of 0.0717 can be matched or approached by a naive base-rate baseline. The number is therefore reported as a descriptive property of this corpus, **not** as evidence of calibration skill. (This is exactly why the IY statistic exists and why a base rate scores zero bits there: calibration and informativeness are different axes, and the base-rate objection that weakens the Brier does not transfer to IY — but IY's priors are themselves self-assigned, §4.4.)

### 6.3 Selection, survivorship, and range restriction

Three related selection threats apply. (a) **Selection**: the operator chooses which forecasts to seal; calls the operator is unsure about may simply not be sealed, biasing the sample toward confident, resolvable calls. (b) **Survivorship**: only closed calls are scored; the relationship between sealed-but-open calls and the closed set is not characterized. (c) **Range restriction**: as the calibration table shows, the entire corpus lives in the upper half of the probability scale (≥ 0.50), with nothing in the lower bands, so the reliability diagram cannot establish calibration across the full probability range — only within the high-confidence region the operator actually occupies.

### 6.4 Small N

With N = 68 closed calls (and a smaller externally-adjudicated subset), confidence intervals on any aggregate metric are wide, and a handful of additional misses would move the Brier materially. Conclusions should be read as preliminary and as a function of corpus growth.

### 6.5 The generative method is unconventional and its mechanism is not validated here

The forecasts are generated by a disclosed, unconventional method (Vedic *jyotish*). This paper makes **no claim whatsoever about the validity of that mechanism**, offers no causal account, and should not be read as evidence for it. The method is disclosed for honesty but is *not* the object of study. The object of study is the *protocol* — sealing, falsifiability, retained misses, reproducible scoring — which is method-agnostic and could be applied to any forecaster regardless of how their forecasts are produced. A reader who rejects the generative method entirely can still evaluate the protocol's claims (anteriority, integrity, retained misses, recomputable scores) on their own terms.

### 6.6 Positioning and pedigree

The record is positioned as an **augmenting input**, never as a decision authority, and explicitly **not** as a substitute for established analysis. **No national-security pedigree, no NSSL/flight-program standing, and no institutional adjudication are claimed.** Any reading of the corpus as carrying agency endorsement or operational standing would be a misreading; none exists.

### 6.7 Grade defensibility

NEAR and PARTIAL are inherently judgment calls, and a different adjudicator could re-grade individual rows. The mitigations are that grades are committed against the *sealed* wording, frozen in a hash-anchored ledger, and conservative by rule where the sealed mechanism is not externally named — but the residual subjectivity is real and is the reason the externally-adjudicated subset (§5.3) is highlighted separately.

---

## 7. Reproducibility

The protocol is built so that a reader can reproduce every quantitative claim without trusting the author.

- **Zero-dependency verifier.** `verify-jyotint.mjs` (Node.js, no external dependencies) fetches the live manifest, recomputes the SHA-256 of every record from the four published inputs (`objectId|dateIssued|title|claim`), confirms the manifest digest, verifies that the OpenTimestamps `.ots` proof commits to that manifest, and recomputes the Brier from the frozen grading ledger. It returns distinct non-zero exit codes for manifest drift, per-record hash mismatch, I/O failure, an `.ots` proof that does not commit to the manifest, and grading-ledger drift.
- **Open corpus.** The forecasts, hashes, probabilities, and grades are released as the *JYOTINT Sealed Forecast Corpus* under CC-BY-4.0, with a machine-readable `CITATION.cff` and a Zenodo concept DOI ([10.5281/zenodo.20630257](https://doi.org/10.5281/zenodo.20630257)) that resolves to the latest version.
- **Machine-readable API.** A CC-BY REST API exposes the index, the graded set, per-advisory records, the calibration/integrity data, the SITA distribution, and the Information-Yield series (`/api/v1/...`), alongside the static `seal-manifest.json`, `grading-ledger.json`, and `calibration.json` artifacts from which all numbers in this paper are computed.
- **Recomputation in practice.** Because the canonicalization template, the hashing algorithm (SHA-256), the grading rule (`brierTerm = (probAtSeal − outcomeValue)²`), and the frozen inputs are all public, the entire pipeline from sealed text → hash → manifest → Brier is reproducible by a third party with standard tools.

---

## 8. Conclusion

We have described a forecasting protocol that separates, and independently makes checkable, the two questions a forecasting record must answer: *did the claim exist, unaltered, before the event?* and *does the record carry information?* The first is addressed cryptographically — canonical-string SHA-256 hashing, a public manifest, an OpenTimestamps Bitcoin anchor, and a zero-dependency verifier that removes the author from the trust chain. The second is addressed with standard proper scoring rules (Brier 0.0717; log-loss 0.322), IPCC-AR6 calibration bins, and an information-theoretic statistic (Information Yield, median ≈ 8.8 bits/call) — with misses retained in the denominator and a frozen, separately anchored grading ledger.

The contribution is the *protocol*, not a skill claim and not a claim about any generative method. The scoring is self-assigned, the sample is small and high-confidence-weighted, and the calibration evidence is confined to the upper half of the probability scale; we have stated these limits plainly and first. What the protocol does establish, and what we believe is portable to other forecasters and other domains, is a falsifiable, tamper-evident, reproducibly scored *form* for a public forecasting record — one in which a third party can verify anteriority, confirm integrity, recompute the scores, and see the failures alongside the successes, without taking the forecaster's word for any of it.

---

## References

Bayer, D., Haber, S., & Stornetta, W. S. (1993). Improving the efficiency and reliability of digital time-stamping. In *Sequences II: Methods in Communication, Security, and Computer Science* (pp. 329–334). Springer. https://doi.org/10.1007/978-1-4613-9323-8_24

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. *Monthly Weather Review*, 78(1), 1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. *Journal of the American Statistical Association*, 102(477), 359–378. https://doi.org/10.1198/016214506000001437

Good, I. J. (1952). Rational decisions. *Journal of the Royal Statistical Society: Series B (Methodological)*, 14(1), 107–114. https://doi.org/10.1111/j.2517-6161.1952.tb00104.x

Haber, S., & Stornetta, W. S. (1991). How to time-stamp a digital document. *Journal of Cryptology*, 3(2), 99–111. https://doi.org/10.1007/BF00196791

Intergovernmental Panel on Climate Change (IPCC). (2021). *Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change* (V. Masson-Delmotte et al., Eds.). Cambridge University Press. https://doi.org/10.1017/9781009157896

Mastrandrea, M. D., Field, C. B., Stocker, T. F., Edenhofer, O., Ebi, K. L., Frame, D. J., Held, H., Kriegler, E., Mach, K. J., Matschoss, P. R., Plattner, G.-K., Yohe, G. W., & Zwiers, F. W. (2010). *Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties*. Intergovernmental Panel on Climate Change. https://www.ipcc.ch/site/assets/uploads/2017/08/AR5_Uncertainty_Guidance_Note.pdf

Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D., Atanasov, P., Swift, S. A., Murray, T., Stone, E., & Tetlock, P. E. (2014). Psychological strategies for winning a geopolitical forecasting tournament. *Psychological Science*, 25(5), 1106–1115. https://doi.org/10.1177/0956797614524255

Murphy, A. H. (1973). A new vector partition of the probability score. *Journal of Applied Meteorology*, 12(4), 595–600. https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2

Tetlock, P. E. (2005). *Expert Political Judgment: How Good Is It? How Can We Know?* Princeton University Press.

Tetlock, P. E., & Gardner, D. (2015). *Superforecasting: The Art and Science of Prediction*. Crown Publishers.

Todd, P. (2016). *OpenTimestamps: A timestamping proof standard*. https://opentimestamps.org/

---

*Author: Vijay Jyotish, Vijay Jyotish LLC. ORCID 0009-0009-0832-9998. Companion dataset DOI [10.5281/zenodo.20630257](https://doi.org/10.5281/zenodo.20630257). This preprint is released under the Creative Commons Attribution 4.0 International License (CC-BY-4.0).*
