A major approach – perhaps the major approach – to the treatment of models in an ensemble is via skill scores. These measure the accuracy of a prediction when compared to an observation.

The motivating question behind their development is this: how can one compare different probabilistic forecasts? It is surprising that there is no straightforward answer to this question. A plethora of different approaches—different scoring rules—exist, and the choice between them is largely pragmatic. They can be applied to forecasts or hindcasts.

Each approach faces problems. Forecast scores work well for high frequency, short timescale predictions like weather forecasting. One can make and test predictions rapidly, and train or select for improved models. In a case like climate change, forecast testing is impractical as the timescales are long, and the purpose of the forecast is to serve as a basis for action in order to prevent or mitigate against the predicted outcome.

Hindcasting faces different problems. In low-frequency cases there is typically too little data to support accurate hindcasting. Consider the hurricane case. The dataset for large storms impacting the North Atlantic coast is small. Taking the widest possible reference class for a category of hurricane, we would have no more than ~40 data points (for the whole coast) to split into one portion on which to train the model, and another to match with predictions. If one worried that some data points aren’t relevant and so restricted the reference class—say to events during periods nearby to the test period, passing through nearby gates—it is highly likely that the entire set would hold only a single data point, if any. In practice, it is therefore conducted against the statistical dataset rather than the real dataset.

In other cases (in particular climate change-related cases) we expect that the will not to be like the past: this is the core hypothesis of climate change. In such circumstances it is unclear that training on historical data, even supposing enough exists, is useful given the purpose of the model. (This will also apply to an extent in the hurricane case, as climate change is thought to influence cyclogenesis.)

Leaving aside these issues of reliable prediction generation for the moment, let us turn their verification. If we have a prediction that a C5 storm is a 1/200 event in the five years between 1990-1995—with different predictions for 5-year intervals on either side due to different input and boundary conditions—is Hurricane Andrew (in 1992) support for this or not? How much support?

The Australian Bureau of Meteorology maintains a website on probabilistic forecast verification techniques.[1] It lists seven categories of forecast (binary, multi-category, continuous, probabilistic, spatial, ensemble, and rare), and covers more than 50 scoring rules, visualisation techniques and analytical approaches to measure forecasting success. Some are appropriate only for a single category, others have broader application (Australian Bureau of Meteorology 2017). Which rule to use is itself a matter of expert debate, and so on pain of introducing a second-order elicitation we cannot easily devise a method for a decision-maker to select a single rule, or devise an algorithm for rule selection.

In practice, the decision will likely be pragmatic. For binary forecasts, one can choose between rules which maximise the proportion of correct predictions, minimise errors, minimise false positives, minimise false negatives, or trade-off between those goals. The decision maker’s assessment of their interests could reasonably influence rule-selection. If there are high risks to missing an event, such as a catastrophe, one might prefer a rule that minimises “misses.” If the costs associated with event preparation are burdensomely high, she might prefer to avoid false positives. If there is sufficient historical data, this choice can be treated decision theoretically: using performance on past data, one could assess each expert’s predictive success on each category (hits, misses, false positives, false negatives), assign each a utility value, and attempt to optimise by choice of scoring rule.

There is an uncomfortable sense of “goal seeking” in this behaviour, however. If we are simply choosing a scoring rule to fit our needs, we risk stacking the deck in favour of our assessment of the utilities. Instead of the probabilities in our decision matrix being an independent assessment of the likely states of the world, they will have constructed in part by the utility assessment. If this is deemed permissible, we might well ask why aggregate at all. Why not simply choose the expert opinion which “optimises”? The obvious dangers of perverse incentives (a cost-averse decision-maker choosing the expert with the lowest probability of events in order to avoid investment, rather than choosing that with the lowest false positive rate to avoid wasted investment) carry through from choice of expert, to choice of scoring rule.

In the hurricane case, things are in some ways easier and in others harder. Easier because for rare events, most scoring rules do not work well (for example, Brier scores are not sensitive enough. Assuming a single hurricane in a hundred-year period, and constant 1/100 and 1/200 predictions by the two experts, their Brier scores will differ only in the 5th decimal place). The Australian Bureau lists just two methods suitable for very low frequency forecasts. One is the “deterministic limit,” which measures the increasing accuracy of forecasts with shorter lead times to the event. (This is, of course, of great interest in predicting rare weather events where the focus is on preparation and mitigation.) The score measures the lead time at which, over a suitably large and representative forecast sample, number of hits equals the total number of misses and false alarms (Hewson, n.d.). This rule is still demanding in terms of frequency data, however, meaning that for many contexts it will not deliver useful results (Hewson, n.d., S.4.). The second is a set of “extremal dependence indices,” developed by (Ferro and Stephenson 2011). However, there are four indices, with different behaviour similar to that described for binary scoring rules above. Choice between them will, again, be pragmatic.

Scoring rules offer the appearance of objectivity, but it is a mirage. The choice of scoring is a value-laden decision with as much impact on the final outcome as a selection of priors in a Bayesian procedure. One advantage of scoring rule approach is that it may facilitate a frank discussion of the values employed in the selection of the rule. Once the rules in place it is “objective” in the sense of being transparent and stable (a procedure-based notion of objectivity), and may therefore lend confidence in the result the opaque elicitation methods lack.


Australian Bureau of Meteorology. 2017. “Forecast Verification.” http://www.cawcr.gov.au/projects/verification/.

Brier, Glenn W. 1950. “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review 78 (1): 1–3. https://docs.lib.noaa.gov/rescue/mwr/078/mwr-078-01-0001.pdf.

Ferro, C.A.T, and D.B. Stephenson. 2011. “Extremal Dependence Indices: Improved Verification Measures for Deterministic Forecasts of Rare Binary Events.” Weather Forecasting 26: 699–713

Hewson, Tim. n.d. “New Approaches to Verifying Forecasts of Hazardous Weather.” Met Office UK. http://www.cawcr.gov.au/projects/verification/Hewson/DeterministicLimit.html.

[1] Meteorology is the source of much of the study on probabilistic forecasting and its verification. The Brier score was developed in that field (Brier 1950), as were its later refinements (Murphy and Epstein 1967).