Just as we can test forecasters, we can also compare two different forecasters who are making predictions over time. The setup follows that of game-theoretic hypothesis testing.

Let be a proper scoring rule. Suppose forecaster 1 makes forecasts at time , and forecaster 2 makes forecasts at time . Define

which is the empirical difference between the forecasters performance, where is the outcome of the -th forecast. We want to quantity the difference between and , the true expected difference between the forecasters:

where is the filtration capturing what’s happened so far. Depending on the behavior of (eg if it’s bounded, light-tailed etc) we can develop confidence sequences for the difference or perform sequential hypothesis testing to determine if (say). Depending on the behavior of , we can form sub-psi process for , which lets apply much of the machinery of safe, anytime-valid inference (SAVI) to this problem.

Refs