Tuesday, 26 November 2013

Are break inhomogeneities a random walk or a noise?

Tomorrow is the next conference call of the benchmarking and assessment working group (BAWG) of the International Surface Temperature Initiative (ISTI; Thorne et al., 2011). The BAWG will create a dataset to benchmark (validate) homogenization algorithm. It will mimic the real mean temperature data of the ISTI, but will include know inhomogeneities, so that we can assess how well the homogenization algorithms remove them. We are almost finished discussing how the benchmark dataset should be developed, but still need to fix some details. Such as the question: Are break inhomogeneities a random walk or a noise?

Previous studies

The benchmark dataset of the ISTI will be global and is also intended to be used to estimate uncertainties in the climate signal due to remaining inhomogeneities. These are the two main improvements over previous validation studies.

Williams, Menne, and Thorne (2012) validated the pairwise homogenization algorithm of NOAA on a dataset mimicking the US Historical Climate Network. The paper focusses on how well large-scale biases can be removed.

The COST Action HOME has performed a benchmarking of several small networks (5 to 19 stations) realistically mimicking European climate networks (Venema et al., 2012). It main aim was to intercompare homogenization algorithms, the small networks allowed HOME to also test manual homogenization methods.

These two studies were blind, in other words the scientists homogenizing the data did not know where the inhomogeneities were. An interesting coincidence is that the people who generated the blind benchmarking data were outsiders at the time: Peter Thorne for NOAA and me for HOME. This probably explains why we both made an error, which we should not repeat in the ISTI.

Inhomogeneities for benchmarking

One of the nice things of benchmarking, of generating artificial inhomogeneities, is that you have to specify exactly how the inhomogeneities look like. By doing so you notice how much we do not know yet. For the ISTI benchmark we noticed how little we know about the statistical properties and causes of inhomogeneities outside of Europe and the USA. The frequency and magnitude of the breaks are expected to be similar, but we know little about the seasonal cycle and biases. This makes it important to include a broad range of possibilities in a number of artificial worlds. Afterward we can test which benchmark was nearest to the real dataset by comparing the detected inhomogeneities for the real data and the various benchmarks.

One of those details is the question how to implement the break inhomogeneities. Some breaks are known in the metadata (station history), for example relocations, changes of instrumentation, changes of screens. If you make a distribution of the jump size, the mean temperature before and after this break, you find a normal distribution with a standard deviation of about 0.7°C for the USA. For Europe the experts thought that 0.8°C would be a realistic value.

But size is not all. There are two main ways to implement such inhomogeneities. You can perturb the homogeneous data between two inhomogeneities (by a random number drawn from a normal distribution); that is what I would call noise. The noise is a deviation from a baseline (the homogeneous validation data, which should be reconstructed at the end).

You can also start at the beginning, perturb the data for the first homogeneous subperiod (HSP), the period between the first and the second break (by a random number), and then perturb the second HSP relative to the first HSP. This is what I would call a random walk.

This makes a difference, for the random walk the deviation from the homogeneous data grows with every break, at least on average. For the noise, the deviation stays the same on average. As a consequence, the random walk thus produces larger trend errors as the noise and the breaks are probably also easier to find.

real inhomogeneities

The next question is: how do real inhomogeneities look like? Like noise, or like a walk? We studied this in the paper on the HOME benchmark by comparing the statistical properties of the detected breaks on the benchmark and on some real datasets. To understand the quote from the HOME paper below you have to know that the benchmark contained two dataset with two different methods to generate the homogeneous data: surrogate and synthetic data.
If the perturbations applied at a break were independent, the perturbation time series would be a random walk. In the benchmark the perturbations are modeled as random noise, as a deviation from a baseline signal, which means that after a large break up (down) the probability of a break down (up) is increased. Defining a platform as a pair of breaks with opposite sign, this means that modeling the breaks as a random noise produces more than 50 % platform pairs [while for a random walk the percentage is 50%, VV]. The percentage of platforms in the real temperature data section is 59 (n=742), in the surrogate data 64 (n=1360), and in the synthetic data 62 (n=1267). The artificial temperature data thus contains more platforms; the real data is more like a random walk. This percentage of platforms and the difference between real and artificial data become larger if only pairs of breaks with a minimum magnitude are considered.
In other words, for a random walk you expect 50% platform break pairs, the real number is clearly higher and close to the value for a noise. However, there are somewhat less platform break pairs as you would expect in a noise. Thus reality is in between, but quite close to noise.

In HOME we have modelled the perturbations as noise, which was luckily good. Lucky, because when generating the dataset I had not not considered the alternative; maybe my colleagues would have warned me. However, I had stupidly not thought of the simple fact that the size of the breaks is larger than the size of the noise, by the square root of two, because the size of one jump is determined by two values, the noise before and the noise after. This is probably the main reason why we found in the same validation that the break we had inserted were too big: we used 0.8°C for the standard deviation of the noise but 0.6°C would have been closer to the real datasets. In the NOAA benchmarking study the perturbation were modelled as a random walk, if I understand it correctly (and a wide range of break sizes from 1.5 to 0.2°C).

The ISTI benchmark

How should we insert break inhomogeneities in the ISTI benchmark? Mainly as noise, but also partially as a random walk, I would argue.

Here it should be remembered that break inhomogeneities are not purely random, but can also have a bias. For example the transition to Stevenson screens resulted in a bias of less than 0.2°C according to Parker (1994) mainly based on North-West European data. The older data had a warm bias due to radiation errors.

It makes sense to expect that once such errors have been noticed, that they are not reintroduced again. In other words, that such bias inhomogeneities behave like random walks, they continue until the end and future inhomogeneities use them as basis. If we would model biases due to inhomogeneities as random walks and their random components as a noise, we may well be close to the mixture of noise and random walk found in real data.

One last complication is that the bias is not constant, the network mean bias of a certain transition will have different effects on every station. For example in case of such radiation errors, it would depend on the insolation and thus cloudiness (for the maximum temperature) and on the humidity and cloudiness at night (minimum temperature) and on wind (because of ventilation).

Thus if the network mean bias is bn, then the station bias (bs) could be drawn from a normal distribution with mean bn and a width of rn. To this one would add a random component (rs)drawn from a normal distribution with mean zero and a standard deviation around 0.6°C. The bias component would be implemented as a perturbation from the break to the end and the random component as a perturbation from one break to the next.


Parker, D.E. Effects of changing exposure of thermometers at land stations. Int. J. Climatol., 14, pp. 1–31 doi: 10.1002/joc.3370140102, 1994.

Thorne, P.W., K.M. Willett, R.J. Allan, S. Bojinski, J.R. Christy, N. Fox, et al. Guiding the Creation of A Comprehensive Surface Temperature Resource for Twenty-First-Century Climate Science. Bull. Amer. Meteor. Soc., 92, ES40–ES47, doi: 10.1175/2011BAMS3124.1, 2011.

Venema, V., O. Mestre, E. Aguilar, I. Auer, J.A. Guijarro, P. Domonkos, G. Vertacnik, T. Szentimrey, P. Stepanek, P. Zahradnicek, J. Viarre, G. Müller-Westermeier, M. Lakatos, C.N. Williams, M.J. Menne, R. Lindau, D. Rasol, E. Rustemeier, K. Kolokythas, T. Marinova, L. Andresen, F. Acquaotta, S. Fratianni, S. Cheval, M. Klancar, M. Brunetti, Ch. Gruber, M. Prohom Duran, T. Likso, P. Esteban, Th. Brandsma. Benchmarking homogenization algorithms for monthly data. Climate of the Past, 8, pp. 89-115, doi: 10.5194/cp-8-89-2012, 2012.

Williams, C.N., Jr., M.J. Menne, and P. Thorne. Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. J. Geophys. Res., 117, no. D05116, doi: 10.1029/2011JD016761, 2012.


Kate Willett said...

This has got me thinking. We're working really hard to reproduce realistic synthetic stations and station cross-correlations. However, this effort will only really be retained over the most recent period which we're considering the 'reference period'. So, in a way should we be working backwards through time in applying inhomogeneities? Once we apply seasonally varying changes we've moved away from the cross-correlations/autocorrelations we've worked so hard to recreate - which is inevitable. I think I'm now even more confused about how best to actually go about perturbing the stations to create inhomogeneities realistically.

Victor Venema said...

Hi Kate, you are right. The inhomogeneities reduce the cross-correlations between the stations (on average). And this change can be quite substantial. That is why it is best to generate the homogeneous data for the benchmark based on the cross-correlations of a homogenized dataset. That will not only improve the decadal variability, but also the cross-correlations, which are very important for relative homogenization.

(After adding the synthetic inhomogeneities to the synthetic homogeneous station data, the cross-correlations should also be similar to the cross-correlations or real raw data. However, that would depend on the magnitude of the inhomogeneities inserted, which we will vary over a broad range. The best way seem to me to be to use homogenized data as an example to generate the homogeneous data.)

PeterThorne said...

I would concur that we are drawing from multiple distributions. Effectively this can potentially be mapped as you say two ways. There are breaks that arise for random reasons which means that you 'reset the clock' and there are breaks which are due to systematic efforts deliberate or otherwise that potentially impart a systematic bias into the network as a whole. On an individual station or local basis both matter. For regional or global long-term trends its the systematic bias artefacts that act as a random walk that matter. These are the artefacts that if not corrected properly impart a residual bias.

In the same way that you are saying use the cross-correlation of homogenized data you could also use the timeseries of adjustment estimates from a densely sampled set of regions to infer to what extent in some subset of the globe biases act as random draw and to what extent as random walk.

On the cross-correlation issue in creating the homogeneous worlds ... perhaps the most recent reanalyses e.g. ERA-Interim / JRA55 or regional nested reanalyses NARRA / EURO4M could help? These won't contain the random and sampling error artefacts and won't get the orographic effects perfectly so will be smoother than the true inter-station correlation field though.

Victor Venema said...

Thanks for your thoughts Peter. Yes, the problem is more complicated. Maybe we would also need a category for semi-permanent breaks. For example for relocations that introduce a new baseline until the next relocation (or the end of the series).

The disadvantage of comparing the cross-correlations in the raw data and the corrupted benchmark data is that you do not only have to get the homogeneous data right, but also have to insert realistic inhomogeneities. Furthermore, errors in the one, can compensate errors in the others.

Preferably we should do both comparisons. I would see comparing homogenized data to clear benchmark data as most important.

I personally tend not to believe models much when it comes to variability. The mean model state is what everyone looks at, the variability is unfortunately not high up on most agendas. At least on scales close to the model resolution, I would expect deviations in the spatial structure. Also the analysis methods aim to estimate the mean state as well as possible and in doing so probably introduce a tendency towards too smooth fields.

The kind of new regional high-resolution reanalysis data might be good for our purposes for their resolution. Or 1-day forecast fields from a global weather predication model; one year should be sufficient to estimate the spatial correlations.