David Stainforth argued in a pair of co-authored articles in 2007 that climate models were so uncertain no individual model should be taken seriously as a predictor of a single event (in that case, a temperature rise). This post explores the arguments presented, and their applicability to hurricane science.

Stainforth, Allen, et al. present an argument against aggregation that is specifically about “complex climate models”: computer simulation models of atmosphere and ocean global circulation (AOGCM), whose purpose is “to mimic laboratories in other scientific disciplines; scientists use them to carry out experiments which are not possible in the real world” (2007, 2145). Globally, a number of different models of this sort exist (~20) and the benchmark work in this field is conducted using a multi-model ensemble.

Stainforth, Allen, et al. argue that today’s AOGCM ensembles provide only a non-discountable envelope of outcomes—a set of possible outcomes. No individual model can provide a reliable central estimate, and therefore the ensemble should not be used to create one through aggregation. “Today’s ensembles give us a lower bound on the maximum range of uncertainty.” (Stainforth, Allen, et al. 2007, 2156) “‘Lower bound’ because further uncertainty exploration is likely to increase it; ‘Maximum range of uncertainty’ because methods to assess a model’s ability to inform us about real-world variables… could potentially constrain the ensemble and reduce the range.” (Stainforth, Downing, et al. 2007, 2166, my italics)

The argument for this conclusion has two parts. First, they present a simple but fundamental problem for aggregating complex climate models: they “cannot be meaningfully calibrated because they are simulating a never before experienced state of the system; the problem is one of extrapolation. It is therefore inappropriate to apply any of the currently available generic techniques which utilize observations to calibrate or weight models to produce forecast probabilities for the real world” (Stainforth, Allen, et al. 2007, 2145). In simple terms, because they are simulating something that has never happened before, we cannot know how accurate these models are.

Calibration refers to “tuning” the model—“that is, the manipulation of the independent variables to obtain a match between the observed and simulated distribution or distributions of a dependent variable or variables” (Oreskes, Shrader-frechette, and Belitz 1994, 643). For climate models, this is typically done by hindcasting: partitioning the available dataset for dependent variables into two parts, adjusting the model to reproduce the first part, and then testing it to ensure that it can predict the second part. In the case of simulating climate change, the best we can do is calibrate the model against past data which, under the hypothesis of climate change, does not demonstrate the effect we are trying to simulate.

This is a fundamental constraint of climate change modelling, that Stainforth, Allen, et al. argue is so limiting that we cannot use these models to make central predictions. It also means there is no dataset on which to test different models in order to generate a skill score/performance-based weighting.

This raises an important question for hurricane modellers looking to predict landfall frequencies over the coming decade or two. To what extent is climate change already influenced cyclogenesis? More basically, given widespread evidence of increasing climate change, how reliable is hind is a method for calibrating hurricane models going forward?

Second, these models are highly uncertain. Stainforth, Allen, et al. identify four sources of uncertainty, each severe in the case of AOGCM ensembles: forcing uncertainty, initial condition uncertainty, parameter uncertainty, and model inadequacy. As an example, they present the wide range of predictions for the 8-year mean precipitation over the Mediterranean basin from December—February, under a doubling of atmospheric CO2. The range, -28%—+20%, is likely to widen as model uncertainty becomes better understood. Because of this empirical inadequacy, they

“consider this [weighting] to be futile. Relative to the real world, all models have effectively zero weight. Significantly non-zero weights may be obtained by inflating observational or model variability… [This is misleading as it] leads us to place more trust in a model whose mean response is substantially different (e.g. 5 standard errors) from observations than one whose mean response is very substantially different (e.g. 7 standard errors) from observations. A more constructive interpretation would be that neither is realistic for this variable and there is no meaning in giving the models weights based upon it.” (Stainforth, Allen, et al. 2007, 2155)

In the face of such uncertainty, any single answer is highly unreliable, with not even the sign being known with confidence. If one did have observations to compare (given the problem of calibration outlined above, I assume that this talk of models with wide deviations from observation is metaphorical), the only way to generate meaningful weights would be to exaggerate the differences between these inadequate and highly uncertain models. While this would provide a relative comparison, engaging in such a process is likely to generate undue confidence by presenting a single result with a single range of uncertainty.[1]

Instead, we should “acknowledge and highlight such unquantifiable uncertainties [and] present results on the basis that they have zero effect on our analysis.” What we can present with certainty is the range of results generated by our models, and the range of uncertainties accompanying them. This motivates the non-discountable envelope approach.

How widely applicable is this argument? From the above it seems that it should apply to highly uncertain models, simulating a substantially new phase of a chaotic system. This is quite a specific set of conditions that limit the application of this argument to more general cases. However, in the very important case of climate modelling, it provides a strong argument against the popular practice of central estimate aggregation.

In the case of hurricanes, my LSE blog post details the significant disagreement over which elements to include in the modelling of hurricane formation. In the face of such model uncertainty, coupled with the initial condition uncertainty (in that case, in the form of the small dataset of historical severe hurricanes), we are arguably in a situation of comparable uncertainty.

In the case of climate models, Stainforth, Downing, et al. (2007) develop a system for using non-discountable envelopes for decision support. In essence, it involves taking seriously all outcomes within the envelope (or more precisely, the translation of the envelope of relevant model variables into a range of values for decision relevant factors). Decisions are made with an eye to boundary values, not central estimates. They argue that this is required here given the nature of climate models, both could well be applied prudentially in other cases where central estimation is not strictly disallowed.


Stainforth, David A., M.R. Allen, E.R. Tredger, and Leonard Smith. 2007. “Confidence, Uncertainty and Decision-Support Relevance in Climate Predictions.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 365 (June): 2145–61. doi:10.1098/rsta.2007.2074.

Stainforth, David A., T. E Downing, R. Washington, A. Lopez, and M. New. 2007. “Issues in the Interpretation of Climate Model Ensembles to Inform Decisions.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 365 (1857): 2163–77. doi:10.1098/rsta.2007.2073.

[1] Note that Stainforth, Allen, et al. accept that while all models are inadequate, this does not make them equal, and they later go on to note that one can safely exclude a model from consideration when it performs much worse than the best model available.