World vs Model World vs Model · Methodology ← BoardGlossary →

World Cup 2026 — methodology ("World vs Model")

⚠️ A research & education experiment — NOT gambling and NOT an encouragement to gamble. We hold no positions and invest no real capital. Every figure is a paper simulation of market structure, not investment returns, advice, or a solicitation. Prediction-market platforms like Polymarket are restricted or banned in some jurisdictions (e.g. Singapore) — know your local laws.

1. What we are trying to do

Beat the market at the World Cup using zero football knowledge.

We never forecast a match, a player, or a team. We take the market's own prices and ask a narrower, more honest question: are those prices internally consistent and unbiased — and if not, can we harvest the inconsistency? The entire edge, if any, comes from the structure of the prices, not from knowing anything about football.

Then we keep score in public. Because these markets resolve, every call is timestamped and graded out-of-sample against what actually happened. The point isn't to look clever — it's to find out, on the record, whether a purely structural strategy can beat a liquid crowd.

2. Why bounded (prediction) markets are interesting for this

A "bounded" market is one that settles to a known value — here, 0 or 1 (a team advances, or it doesn't). That boundedness is exactly what makes zero-knowledge strategies viable, in four ways an ordinary stock doesn't offer:

  1. Ground truth arrives. A stock never tells you its "true value." A prediction market resolves — so a forecast is falsifiable and skill is measurable, not just assertable. This is what lets us run an honest scorecard instead of a vibe.
  2. Hard mathematical constraints exist for free. Prices are probabilities. Sub-events nest (winning ⊆ reaching the final ⊆ reaching the semi …). Mutually-exclusive outcomes must sum to the number of slots. These create no-arbitrage relationships that need no domain model — pure logic flags a mispricing.
  3. There are well-documented behavioral biases. The favorite–longshot bias (longshots systematically overpriced, favorites underpriced) is one of the oldest and most-replicated anomalies in betting markets — horse racing, sports, elections, for decades. It's a shape in the price distribution we can correct without any view on a specific team.
  4. Breadth. One tournament is hundreds of correlated markets (advance / QF / SF / final / win, per team). That breadth lets us build a market-neutral long/short book instead of a single winner pick — and breadth is the cleanest lever there is (Grinold's law: information ratio ≈ skill × √breadth).

3. The market surface: a nested ladder

Our single market data source is Polymarket. We read its live, real-money World Cup prices through the public Gamma API — no private feeds, no broker. Those prices are the "world" / market side of every comparison on the board; the crowd's money is our benchmark, and we link straight out to the underlying Polymarket market on every team. (We are not affiliated with Polymarket, and — see the banner — it is restricted in some jurisdictions; the links are reference-only.) Because everything keys off one transparent, reproducible price source, there is no separate "data source" tab to maintain: it is Polymarket, end to end.

Polymarket lists the World Cup as several separately-traded events. Per team they form a nested ladder of "how far do you go," each level with a known number of slots:

RoundSlots (prices must sum to)
Advance to knockouts32
Reach quarter-final8
Reach semi-final4
Reach final2
Win the Cup1

≈240 markets vs 48 winner-only — 5× the breadth, and where the cleanest structural tells live.

4. Why we think there's an edge — and why it could work

Three independent structural sources, none of which require football knowledge:

We measure all of this against the market's own price as the baseline — skill-vs-market, the only comparison that matters. Beating a coin flip is meaningless; beating the de-vigged crowd is the whole game.

5. Sizing & the four books (2 models × 2 styles)

A fixed $1,000 paper bankroll per book, conviction-weighted (stake ∝ |edge|, capped per trade), dollar-neutral, net of the half-spread. Each ticket holds signed YES-shares: unrealized = shares × (price − entry), realized = shares × (outcome − entry). Every book is frozen at day 0 into the ledger (ledger/wc_*.jsonl) and marked to live, so each shows real running Entry → Now → PnL — not a re-sized proposal.

Two trading styles, run identically for both models so the comparison is symmetric:

Crossed with the two models that produce the edges — the zero-knowledge structural model and the informed Elo model (§5b) — that is four books, each its own tab with its own PnL: Buy & Hold, Active, Elo · Buy & Hold, Elo · Active. Comparing across the grid is itself the finding: if Active doesn't beat Buy & Hold net of churn the daily updates are noise; if the Elo books don't beat the zero-knowledge ones, football knowledge didn't help — and we'd report either honestly.

5b. The fundamental model (Elo) — an independent second opinion

The structural book above is derived from the market, so by construction it can only disagree with the crowd a little. To get a genuinely independent view we run the engine's own simulation — Elo + Poisson goals over the verified 2026 bracket — on real per-team World Football Elo ratings (eloratings.net, dated, reproducible), not seeded from Polymarket.

Four disclosed choices make it honest (all priors from football, not tuned to the market):

What it finds (live): a clean results-vs-reputation split — the Elo model rates Spain, Argentina and the in-form South-Americans (Colombia, Ecuador, Uruguay) higher than the market, and fades the reputation teams (France, England, Portugal, Germany) whose results have lagged their squad value. That's a real, defensible disagreement.

How the Elo model sizes and places its bets. The Elo book is sized exactly like the structural book — conviction-weighted and dollar-neutral: long the markets the model thinks are under-priced, short the over-priced, each side getting half a fixed $1,000 paper bankroll, stake ∝ |edge| and capped per trade so no single bet dominates. The crucial difference is where the edge comes from:

Which of its bets to trust: the win-level disagreements (results-vs-reputation) are the defensible part; the advance-level bets are larger but less reliable — the market usually knows a team's specific group context better than a single backward-looking Elo number.

The honest caveat, stated on the board itself: a model built on public ratings is usually less sharp than a liquid, heavily-traded market — so a big gap is more likely the model being cruder than the crowd being wrong. Big disagreement ≠ edge. Both books are scored by the same live scorecard; that's what adjudicates. (src/worldcup_fundamental.py.)

The most-likely outcome map. The same 20k simulations also drive a projection on the board: the projected group stage (each team's most-likely finishing rank + advance %, from the group finishing-position distribution) and a knockout pyramid (the most-likely Quarter-finalists → Semi-finalists → Finalists → Champion, read off the reach-round probabilities). These are the model's probabilities, shown in its violet colour — a picture of what the informed model expects, not a market consensus.

6. What's missing, and how we'd improve it (read this part)

We are deliberately loud about the limitations — the credibility is in the caveats.

What's missing / weak today

How we'd improve it

7. Honesty guardrails

The math — formulas & techniques

A compact, reproducible reference. Prices are probabilities \(p \in (0,1)\); each round has \(S\) slots (advance 32, QF 8, SF 4, final 2, win 1). Everything is plain arithmetic over a handful of disclosed priors — none fit to the market: the favorite–longshot \(\alpha\), the Elo shrinks \(\kappa\), the host bonus, the rating-uncertainty \(\sigma\), the Dixon–Coles \(\rho\), and the cost half-spread \(c\).

De-vig & overround. Raw prices sum to more than the slots (the bookmaker's margin); we rescale each level proportionally to its slot count:

$$q_i = \frac{p_i}{\sum_j p_j}\,S, \qquad \text{overround} = \frac{\sum_j p_j}{S} - 1 .$$

Zero-knowledge model — favorite–longshot correction. A power transform of the shape, renormalised back to the slots, with \(\alpha = 1.15\):

$$\text{model}_i = \frac{p_i^{\alpha}}{\sum_j p_j^{\alpha}}\,S, \qquad \text{edge}_i = \text{model}_i - q_i .$$

\(\alpha > 1\) lifts favorites and trims longshots — the historically profitable direction of the bias.

No-arbitrage (nested). For each team the deeper rounds can't exceed the shallower ones:

$$P(\text{win}) \le P(\text{final}) \le P(\text{SF}) \le P(\text{QF}) \le P(\text{advance}) .$$

Any \(P(\text{deeper}) > P(\text{shallower}) + \text{tol}\) is a riskless mispricing.

Cost (half-spread) netting. Every edge shown in a book is net of a flat half-spread \(c = 0.01\) (≈ 1 cent, a typical Polymarket tick):

$$\text{edge}_{\text{net}} = \operatorname{sign}(\text{edge})\cdot\max\!\big(|\text{edge}| - c,\ 0\big).$$

A gap that doesn't clear the cost to trade it is sized to zero — we only act on edges that survive the spread. (The model-vs-market comparison table still shows the gross disagreement; netting applies where money is at stake.)

Sizing — the paper book. Conviction-weighted and dollar-neutral, on the net edge. Each side (long/short) gets half a bankroll \(B\), allocated by \(|\text{edge}|\) and capped per trade:

$$\text{stake}_i = \min\!\left(\frac{B}{2}\cdot\frac{|\text{edge}_i|}{\sum_j|\text{edge}_j|},\ \ \text{cap}\right), \quad \text{cap} = \tfrac{B}{2}\cdot 0.15 ,$$

$$\text{shares}_i = \pm\,\frac{\text{stake}_i}{\max(\kappa_i,\ \varepsilon)}, \quad \kappa_i = p_i\ (\text{long})\ \text{ or }\ 1-p_i\ (\text{short}).$$

Mark-to-market \(\text{PnL} = \text{shares}\cdot(\text{price}_{\text{now}} - \text{entry})\); realised uses the settled outcome \(o \in \{0,1\}\). A binary bet can't lose more than its stake: \(\max\!\downarrow = -\,\text{stake}\), \(\max\!\uparrow = |\text{shares}| - \text{stake}\).

The informed (Elo) model. Match win probability (Elo expected score), with a home bonus \(h\), and the rating update from a result (\(S \in \{1,\tfrac12,0\}\), \(K = 24\)):

$$E = \frac{1}{1 + 10^{-(R_A - R_B + h)/400}}, \qquad R \leftarrow R + K\,\ln\!\big(|\Delta\text{goals}| + 1\big)\,(S - E).$$

Group-stage scoreline — a Dixon–Coles-corrected double Poisson (the \(\rho\) term lifts the low-score draws that independent Poisson under-produces, giving a realistic \(\sim 27\%\) draw rate):

$$g_A \sim \text{Poisson}(\lambda_A),\quad \lambda_A = \mu\,e^{+b(R_A - R_B + h)},\quad \lambda_B = \mu\,e^{-b(R_A - R_B + h)},$$

with \(\mu = 1.35\), \(b = 0.0028\), \(\rho = -0.05\); the DC factor \(\tau_\rho\) reweights the \((0\text{-}0,\ 1\text{-}0,\ 0\text{-}1,\ 1\text{-}1)\) cells.

Disclosed priors on the ratings: a host bonus of \(+60\) Elo for the three 2026 co-hosts (added before the shrink), and a knockout-only shrink toward the field mean \(\bar{R}\):

$$R' = \bar{R} + \kappa\,(R - \bar{R}), \qquad \kappa_{\text{group}} = 1.0,\quad \kappa_{\text{KO}} = 0.6.$$

Rating uncertainty. Each of the \(N\) simulations perturbs every team's rating by \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\), \(\sigma = 70\) Elo (the same \(\varepsilon\) for a team's group and knockout ratings), integrating over our uncertainty about true strength — so a favorite reads \(\sim 99\%\), not a false 100%, to advance.

Bracket. Qualifiers are seeded into a balanced bracket (winners take the protected top seeds, then runners-up, then best-thirds, by rating within each tier), so winners are kept apart and same-group teams don't meet in the Round of 32 — a structural approximation, not the exact official slot table.

Monte-Carlo reach probabilities. Over \(N = 20{,}000\) sims, \(P(\text{level}) = \text{count}/N\). The Monte-Carlo error — how precisely the sim estimates its own answer — is

$$\mathrm{SE} \approx \sqrt{\frac{p(1-p)}{N}} \approx 0.2\text{–}0.3\%.$$

That is not the forecast's error bar: the real uncertainty is dominated by the inputs (the Elo ratings, the host bonus, the bracket path, and everything the model can't see — form, injuries, motivation), which is far larger. So a big model-vs-market gap is not "signal because it clears the MC noise"; treat it as a disagreement the scorecard adjudicates, not a proven edge. The rating-uncertainty band is a first, partial attempt to reflect this.

Scoring (the public scorecard). Out-of-sample, always vs the market — lower Brier is better, positive skill means the model beat the crowd:

$$\text{Brier} = \frac{1}{n}\sum_i (p_i - o_i)^2, \qquad \text{skill} = \text{Brier}_{\text{market}} - \text{Brier}_{\text{model}} .$$

Where this lives in code

PieceModule
De-vig / favorite–longshot / no-arb detectorssrc/consistency.py
Full nested ladder + per-level overround + arb scansrc/worldcup_markets.py
Paper positions, mark-to-live, resolve PnLsrc/worldcup_positions.py
Conviction-weighted sizing + the HTML boardsrc/worldcup_board.py
Two books — Buy & Hold vs Active Tradingsrc/worldcup_books.py
Bounded-market calibration / time-to-resolution signalssrc/bounded.py
Cost / capacity analysissrc/capacity.py
Pre-registered, timestamped hypothesisdocs/preregistration-worldcup.md