Why We Use CRPS Instead of Win Rate (And Why It Matters)
If you follow prediction markets or sports analytics, you've probably seen services advertise their "win rate" — "78% accuracy!" or "9 out of 10 correct!" On the surface, these numbers sound impressive. But win rate is actually a terrible metric for evaluating probability forecasters, and here's why.
The Problem with Win/Loss
Imagine two forecasters evaluating the same market: "Will Bitcoin exceed $100K by March 2026?"
Forecaster A says: 51% YES. Forecaster B says: 92% YES. The event happens. Both forecasters are "correct." Their win rates both go up by one.
But clearly Forecaster B made a better prediction. They were more confident and more accurate. A 51% forecast is barely better than flipping a coin — it conveys almost no useful information. Yet under a win/loss framework, both predictions are treated identically.
This is the fundamental flaw: win rate rewards being on the right side but doesn't care how right you were. It incentivizes hedging — always predicting close to 50% so you're never "wrong" by much — rather than making precise, informative forecasts.
What CRPS Actually Measures
Continuous Ranked Probability Score (CRPS) is the gold standard in meteorology, epidemiology, and professional forecasting. It measures the distance between your predicted probability distribution and the actual outcome.
For binary events (yes/no), CRPS simplifies to a clean formula: it penalizes you proportionally to how far your probability estimate was from the true outcome (0 or 1). A forecast of 90% on an event that happens scores much better than a forecast of 55% on the same event. And a forecast of 90% on an event that doesn't happen is penalized much more than a forecast of 55% that didn't happen.
Lower CRPS is better. A perfect forecaster (who somehow always knew the true probability) would score 0. A coin-flip forecaster (always predicting 50%) would score around 0.25.
How We Use CRPS for Nightly Recalibration
Every night, our recalibration system processes all resolved forecasts and computes CRPS scores per category. We then use these scores to adjust category-specific calibration multipliers.
For example, if our crypto forecasts have been systematically overconfident (predicting 80% when the actual base rate is 73%), the nightly process detects this pattern and adjusts the crypto calibration multiplier downward. The next day's crypto forecasts are automatically nudged toward more accurate levels.
This creates a continuous feedback loop: forecast → outcome → score → adjust → improved forecast. Over time, our CRPS converges toward the theoretical optimum for each category.
Per-Category Multipliers
Different categories have different base rates, volatilities, and difficulty levels. Politics is often well-predicted by polls and models (higher accuracy). Crypto markets are inherently volatile and harder to forecast (lower accuracy, but with larger informational edges when we're right).
Rather than applying a single global correction, we maintain separate calibration multipliers for each of our 8 categories: Politics, Crypto, Sports, Economics, Tech, Culture, Science, and Other. This ensures our forecasts are well-calibrated within each domain.
Evaluating Us Yourself
We publish our full CRPS scores, category breakdowns, and calibration curves on our Track Record page. You can see exactly how calibrated we are across every category, including the forecasts we got wrong.
We believe transparent evaluation is essential. Any service that only shows you win rate — and hides their miscalibrated forecasts — is giving you an incomplete picture. CRPS holds us to a higher standard, and we publish the results for anyone to verify.