Strategy · 9 min read

Backtesting binary prediction market strategies honestly

Survivorship bias, look-ahead bias, and fee modeling -- the three mistakes that make backtests look better than live trading.

Why binary backtests lie

Binary prediction markets are unusual to backtest. You are not testing a continuous price strategy -- you are testing whether a signal predicts a discrete 0/1 outcome within a fixed window. The opportunities for self-deception are plentiful, and most backtests I have seen overstate edge by 2-5x.

Here are the three main failure modes.

1. Survivorship bias

If you download historical Polymarket markets and filter for BTC 5-minute binaries, you are looking at markets that existed. Markets that were created but never reached sufficient liquidity were likely closed or removed from the feed. Your signal may look great on the markets that survived, but awful on the full population.

The fix: pull market data from the Gamma API with closed=true AND closed=false, then filter by your entry criteria in code -- not by pre-selecting market types.

Also: do not filter out markets where you would have "not traded". If your signal says skip, record the outcome anyway. Your skip rate is part of your strategy and should be evaluated on actual market data.

// Bad: only backtest markets where your signal fired
const tradesBacktest = allMarkets.filter(m => mySignalFired(m))

// Better: test the signal decision on every market
const allDecisions = allMarkets.map(m => ({
  market: m,
  decision: mySignalFired(m) ? 'BUY' : 'SKIP',
  outcome: m.resolvedOutcome,
}))

2. Look-ahead bias

Look-ahead bias means you used information in your signal that would not have been available at trade time. Common forms:

  • ATR calculated on the full session: when deciding whether to trade the 09:00 candle, you used ATR values that include the 15:00 and 17:00 candles.
  • Volume imbalance over the whole expiry period: the bot in live trading would only have volume data up to entry time.
  • Resolution used in the entry filter: checking "did it resolve YES" to decide whether the trade was valid.

The test: for each trade decision, can you construct that exact signal using only data available at t_entry? If you need data from t_entry + n_minutes, it is look-ahead.

// Bad: compute ATR over the entire session including post-entry data
const atr = computeATR(allCandles)

// Correct: compute ATR on candles available at entry time only
const candlesAtEntry = allCandles.filter(c => c.closeTime <= entryTime)
const atr = computeATR(candlesAtEntry.slice(-14))

3. Fee modeling

The single most common error: assuming fees are 0 or using a constant percentage.

Polymarket fees are non-linear:

fee = shares * feeRate * price * (price * (1 - price))^exponent

At p=0.50, the taker fee is ~1.56% of notional. At p=0.30, it drops to ~0.5%. A backtest that uses a flat 0.5% fee will overstate edge in near-50 markets by 3x.

function takerFee(shares, price, feeRate = 0.25, exponent = 2) {
  return shares * feeRate * price * Math.pow(price * (1 - price), exponent)
}

function netPnl(shares, entryPrice, exitPrice) {
  const gross = shares * (exitPrice - entryPrice)
  const entryFee = takerFee(shares, entryPrice)
  const exitFee = takerFee(shares, exitPrice)
  return gross - entryFee - exitFee
}

Also model the bid-ask spread. In a backtest, assume you buy at ask and sell at bid, not at mid. On thin books the spread can be $0.03-0.05, which is 3-5% on a 50-cent market.

Use range estimates, not point estimates

A backtest on 200 trades gives you a noisy estimate of edge. Report confidence intervals, not just the mean.

function bootstrapEdge(trades, nSamples = 10_000) {
  const n = trades.length
  const means = []
  for (let i = 0; i < nSamples; i++) {
    let sum = 0
    for (let j = 0; j < n; j++) {
      sum += trades[Math.floor(Math.random() * n)].pnl
    }
    means.push(sum / n)
  }
  means.sort((a, b) => a - b)
  return {
    mean: means[Math.floor(nSamples / 2)],
    p5: means[Math.floor(nSamples * 0.05)],
    p95: means[Math.floor(nSamples * 0.95)],
  }
}

If the p5 estimate is negative, your edge is not statistically established over that sample. Do not go live.

Out-of-sample testing

Split your historical data. Use the first 70% to tune parameters (threshold, ATR period, imbalance cutoff). Test the tuned parameters on the remaining 30% without touching them again.

If the out-of-sample Sharpe is less than 60% of the in-sample Sharpe, you over-fitted. Relax the parameters or collect more data.

Slippage on entry

In live trading, your FOK order fills at the best ask at the moment of submission. The ask moves between your signal firing and your order reaching the matching engine. Assume 1-2 tick ($0.01-0.02) of additional slippage on each entry in your backtest.

Summary

  • Include all markets in the sample, not just the ones that fit your trade criteria.
  • Every signal input must be computable from data available strictly before entry time.
  • Use the exact non-linear fee formula -- not a constant percentage.
  • Assume you pay the ask on entry and receive the bid on exit.
  • Report p5-p95 confidence intervals from bootstrapping, not just the mean.
  • Validate out-of-sample on the final 30% of data, touch it only once.

Related bots