Friday, 27 June 2014

Self-review of problems with the HOME validation study for homogenization methods

In my last post, I argued that post-publication review is no substitute for pre-publication review, but it could be a nice addition.

This post is a post-publication self-review, a review of our paper on the validation of statistical homogenization methods, also called benchmarking when it is a community effort. Since writing this benchmarking article we have understood the problem better and have found some weaknesses. I have explained these problems on conferences, but for the people that did not hear them, please find them below after a short introduction. We have a new paper in open review that explains how we want to do better in the next benchmarking study.

Benchmarking homogenization methods

In our benchmarking paper we generated a dataset that mimicked real temperature or precipitation data. To this data we added non-climatic changes (inhomogeneities). We requested the climatologists to homogenize this data, to remove the inhomogeneities we had inserted. How good the homogenization algorithms are can be seen by comparing the homogenized data to the original homogeneous data.

This is straightforward science, but the realism of the dataset was the best to date and because this project was part of a large research program (the COST Action HOME) we had a large number of contributions. Mathematical understanding of the algorithms is also important, but homogenization algorithms are complicated methods and it is also possible to make errors in the implementation, thus such numerical validations are also valuable. Both approaches complement each other.

Group photo at a meeting of the COST Action HOME with most of the European homogenization community present. These are those people working in ivory towers, eating caviar from silver plates, drinking 1985 Romanee-Conti Grand Cru from crystal glasses and living in mansions. Enjoying the good live on the public teat, while conspiring against humanity.

The main conclusions were that homogenization improves the homogeneity of temperature data. Precipitation is more difficult and only the best algorithms were able to improve it. We found that modern methods improved the quality of temperature data about twice as much as traditional methods. It is thus important that people switch to one of these modern methods. My impression from the recent Homogenisation seminar and the upcoming European Meteorological Society (EMS) meeting is that this seems to be happening.

1. Missing homogenization methods

An impressive number of methods participated in HOME. Also many manual methods were applied, which are validated less because this is more work. All the state-of-the-art methods participated and most of the much used methods. However, we forgot to test a two- or multi-phase regression method, which is popular in North America.

Also not validated is HOMER, the algorithm that was designed afterwards using the best parts of the tested algorithms. We are working on this. Many people have started using HOMER. Its validation should thus be a high priority for the community.

2. Size breaks (random walk or noise)

Next to the benchmark data with the inserted inhomogeneities, we also asked people to homogenize some real datasets. This turned out to be very important because it allowed us to validate how realistic the benchmark data is. Information we need to make future studies more realistic. In this validation we found that the size of the benchmark in homogeneities was larger than those in the real data. Expressed as the standard deviation of the break size distribution, the benchmark breaks were typically 0.8°C and the real breaks were only 0.6°C.

This was already reported in the paper, but we now understand why. In the benchmark, the inhomogeneities were implemented by drawing a random number for every homogeneous period and perturbing the original data by this amount. In other words, we added noise to the homogeneous data. However, the homogenizers that requested to make breaks with a size of about 0.8°C were thinking of the difference from one homogeneous period to the next. The size of such breaks is influenced by two random numbers. Because variances are additive, this means that the jumps implemented as noise were the square root of two (about 1.4) times too large.

The validation showed that, except for the size, the idea of implementing the inhomogeneities as noise was a good approximation. The alternative would be to draw a random number and use that to perturb the data relative to the previously perturbed period. In that case you implement the inhomogeneities as a random walk. Nobody thought of reporting it, but it seems that most validation studies have implemented their inhomogeneities as random walks. This makes the influence of the inhomogeneities on the trend much larger. Because of the larger error, it is probably easier to achieve relative improvements, but because the initial errors were absolutely larger, the absolute errors after homogenization may well have been too large in previous studies.

You can see the difference between a noise perturbation and a random walk by comparing the sign (up or down) of the breaks from one break to the next. For example, in case of noise and a large upward jump, the next change is likely to make the perturbation smaller again. In case of a random walk, the size and sign of the previous break is irrelevant. The likeliness of any sign is one half.

In other words, in case of a random walk there are just as much up-down and down-up pairs as there are up-up and down-down pairs, every combination has a chance of one in four. In case of noise perturbations, up-down and down-up pairs (platform-like break pairs) are more likely than up-up and down-down pairs. The latter is what we found in the real datasets. Although there is a small deviation that suggests a small random walk contribution, but that may also be because the inhomogeneities cause a trend bias.

3. Signal to noise ratio varies regionally

The HOME benchmark reproduced a typical situation in Europe (the USA is similar). However, the station density in much of the world is lower. Inhomogeneities are detected and corrected by comparing a candidate station to neighbouring ones. When the station density is less, this difference signal is more noisy and this makes homogenization more difficult. Thus one would expect that the performance of homogenization methods is lower in other regions. Although, also the break frequency and break size may be different.

Thus to estimate how large the influence of the remaining inhomogeneities can be on the global mean temperature, we need to study the performance of homogenization algorithms in a wider range of situations. Also for the intercomparison of homogenization methods (the more limited aim of HOME) the signal (break size) to noise ratio is important. Domonkos (2013) showed that the ranking of various algorithms depends on the signal to noise ratio. Ralf Lindau and I have just submitted a manuscript that shows that for low signal to noise ratios, the multiple breakpoint method PRODIGE is not much better in detecting breaks than a method that would "detect" random breaks, while it works fine for higher signal to noise ratios. Other methods may also be affected, but possibly not in the same amount. More on that later.

4. Regional trends (absolute homogenization)

The initially simulated data did not have a trend, thus we explicitly added a trend to all stations to give the data a regional climate change signal. This trend could be both upward or downward, just to check whether homogenization methods might have problems with downward trends, which are not typical of daily operations. They do not.

Had we inserted a simple linear trend in the HOME benchmark data, the operators of the manual homogenization could have theoretically used this information to improve their performance. If the trend is not linear, there are apparently still inhomogeneities in the data. We wanted to keep the operators in the blind. Consequently, we inserted a rather complicated and variable nonlinear trend in the dataset.

As already noted in the paper, this may have handicapped the participating absolute homogenization method. Homogenization methods used in climate are normally relative ones. These methods compare a station to its neighbours, both have the same regional climate signal, which is thus removed and not important. Absolute methods do not use the information from the neighbours; these methods have to make assumptions about the variability of the real regional climate signal. Absolute methods have problems with gradual inhomogeneities and are less sensitive and are therefore not used much.

If absolute methods are participating in future studies, the trend should be modelled more realistically. When benchmarking only automatic homogenization methods (no operator) an easier trend should be no problem.

5. Length of the series

The station networks simulated in HOME were all one century long, part of the stations were shorter because we also simulated the build up of the network during the first 25 years. We recently found that criterion for the optimal number of break inhomogeneities used by one of the best homogenization methods (PRODIGE) does not have the right dependence on the number of data points (Lindau and Venema, 2013). For climate datasets that are about a century long, the criterion is quite good, but for much longer or shorter datasets there are deviations. This illustrates that the length of the datasets is also important and that it is important for benchmarking that the data availability is the same as in real datasets.

Another reason why it is important that the benchmark data availability to be the same as in the real dataset is that this makes the comparison of the inhomogeneities found in the real data and in the benchmark more straightforward. This comparison is important to make future validation studies more accurate.

6. Non-climatic trend bias

The inhomogeneities we inserted in HOME were on average zero. For the stations this still results in clear non-climatic trend errors because you only average over a small number of inhomogeneities. For the full networks the number of inhomogeneities is larger and the non-climatic trend error thus very small. It was consequently very hard for the homogenization methods to improve this small errors. It is expected that in real raw datasets there is a larger non-climatic error. Globally the non-climatic trend will be relatively small, but within one network, where the stations experienced similar (technological and organisational) changes, it can be appreciable. Thus we should model such a non-climatic trend bias explicitly in future.

International Surface Temperature Initiative

The last five problems will be solved in the International Surface Temperature Initiative (ISTI) benchmark . Whether a two-phase homogenization method will participate is beyond our control. We do expect less participants than in HOME because for such a huge global dataset, the homogenization methods will need to be able to run automatically and unsupervised.

The standard break sizes will be made smaller. We will make ten benchmarking "worlds" with different kinds of inserted inhomogeneities and will also vary the size and number of the inhomogeneities. Because the ISTI benchmarks will mirror the real data holdings of the ISTI, the station density and the length of the data will be the same. The regional climate signal will be derived from a global circulation models and absolute methods could thus participate. Finally, we will introduce a clear non-climate trend bias to several of the benchmark "worlds".

The paper on the ISTI benchmark is open for discussions at the journal Geoscientific Instrumentation, Methods and Data Systems. Please find the abstract below.

The International Surface Temperature Initiative (ISTI) is striving towards substantively improving our ability to robustly understand historical land surface air temperature change at all scales. A key recently completed first step has been collating all available records into a comprehensive open access, traceable and version-controlled databank. The crucial next step is to maximise the value of the collated data through a robust international framework of benchmarking and assessment for product intercomparison and uncertainty estimation. We focus on uncertainties arising from the presence of inhomogeneities in monthly surface temperature data and the varied methodological choices made by various groups in building homogeneous temperature products. The central facet of the benchmarking process is the creation of global scale synthetic analogs to the real-world database where both the "true" series and inhomogeneities are known (a luxury the real world data do not afford us). Hence algorithmic strengths and weaknesses can be meaningfully quantified and conditional inferences made about the real-world climate system. Here we discuss the necessary framework for developing an international homogenisation benchmarking system on the global scale for monthly mean temperatures. The value of this framework is critically dependent upon the number of groups taking part and so we strongly advocate involvement in the benchmarking exercise from as many data analyst groups as possible to make the best use of this substantial effort.

Related reading

Nick Stokes made a beautiful visualization of the raw temperature data in the ISTI database. Homogenized data where non-climatic trends have been removed is unfortunately not yet available, that will be released together with the results of the benchmark.

New article: Benchmarking homogenisation algorithms for monthly data. The post describing the HOME benchmarking article.

New article on the multiple breakpoint problem in homogenization. Most work in statistics is about data with just one break inhomogeneity (change point). In climate there are typically more breaks. Methods designed for multiple breakpoints are more accurate.

Part 1 of a series on Five statistically interesting problems in homogenization.


Domonkos, P., 2013: Efficiencies of Inhomogeneity-Detection Algorithms: Comparison of Different Detection Methods and Efficiency Measures. Journal of Climatology, Art. ID 390945, doi: 10.1155/2013/390945.

Lindau and Venema, 2013: On the multiple breakpoint problem and the number of significant breaks in homogenization of climate records. Idojaras, Quarterly Journal of the Hungarian Meteorological Service, 117, No. 1, pp. 1-34. See also my post: New article on the multiple breakpoint problem in homogenization.

Lindau and Venema, to be submitted, 2014: The joint influence of break and noise variance on the break detection capability in time series homogenization.

Willett, K., Williams, C., Jolliffe, I., Lund, R., Alexander, L., Brönniman, S., Vincent, L. A., Easterbrook, S., Venema, V., Berry, D., Warren, R., Lopardo, G., Auchmann, R., Aguilar, E., Menne, M., Gallagher, C., Hausfather, Z., Thorarinsdottir, T., and Thorne, P. W.: Concepts for benchmarking of homogenisation algorithm performance on the global scale, Geosci. Instrum. Method. Data Syst. Discuss., 4, 235-270, doi: 10.5194/gid-4-235-2014, 2014.


Peter Thorne said...

Health warning: This comment contains an analogy.

Victor, so much of science is incremental both in accretion of knowledge but also methods and capabilities. Objects in the rear view mirror are always distorted.

Anyway, the analogy ...

So, when I was a wee nipper of a lad growing up near the English coast every summer we would go to the coast and dangle our string with a rock and a piece of bacon over the side to try to catch some crabs (the eight legged creatures ...!). Of course, 9 times out of 10 the critter lets go before you get it up to the bucket. Darn!

So, after a year or two you start thinking can I do this better so get some wire and fashion it into a circle and then tie chicken wire into it and now there's a surface that most of the time but not all of the time the critters now stay on if you haul it up quickly enough.

Another year or two passes and now you think well, if I now create a second, similar, ring and weight it and have chicken wire in addition between the two then the critters who get on will never come off. And now you need your parent (or in one case their landrover and tow hitch!) to pull the net out.

At each point what was being done was to best knowledge and in each case it was superceded based upon prior experience and knowledge advances.

Beyond being an object microcosm of why we are over-fishing and the issue of global commons (other parents were very unimpressed!) I think this shows how in another decade we'll look back similarly on what we are doing now and nit pick. It would be extremely worrying if we were not to. Science and scientific knowledge should always advance and we can always do better.

Victor Venema said...

Peter, you should know that analogies lead to mayhem in the climate "debate", :-) but maybe we can use it among colleagues.

I fully agree with you. That is also why I am so enthusiastic about our ISTI benchmark mimicking the ISTI raw data holdings. That makes it possible to study discrepancies in the statistical properties of the detected inhomogeneities with unprecedented accuracy. This combined with the planned 3-year benchmarking cycle will allow us to make the validation more accurate every time.

The more we feel we were naive a decade ago, the more we have learned. HOME contributed a lot to that learning.

Daneel Olivaw said...

Quite interesting, even for a complete layman regarding homogeneization.
Thanks, Victor. You have a great skill in explaining complex problems :)

Victor Venema said...

Thanks for the compliment, Daneel. In that case I will remove the warning, "(The rest of this post describing the problems is unfortunately more technical as the average post on this blog.)", which may discourage some people from reading further. I guess my self-selected readers are relatively smart.