Showing posts with label benchmarking. Show all posts
Showing posts with label benchmarking. Show all posts

Tuesday, 26 November 2013

Are break inhomogeneities a random walk or a noise?

Tomorrow is the next conference call of the benchmarking and assessment working group (BAWG) of the International Surface Temperature Initiative (ISTI; Thorne et al., 2011). The BAWG will create a dataset to benchmark (validate) homogenization algorithm. It will mimic the real mean temperature data of the ISTI, but will include know inhomogeneities, so that we can assess how well the homogenization algorithms remove them. We are almost finished discussing how the benchmark dataset should be developed, but still need to fix some details. Such as the question: Are break inhomogeneities a random walk or a noise?

Previous studies

The benchmark dataset of the ISTI will be global and is also intended to be used to estimate uncertainties in the climate signal due to remaining inhomogeneities. These are the two main improvements over previous validation studies.

Williams, Menne, and Thorne (2012) validated the pairwise homogenization algorithm of NOAA on a dataset mimicking the US Historical Climate Network. The paper focusses on how well large-scale biases can be removed.

The COST Action HOME has performed a benchmarking of several small networks (5 to 19 stations) realistically mimicking European climate networks (Venema et al., 2012). It main aim was to intercompare homogenization algorithms, the small networks allowed HOME to also test manual homogenization methods.

These two studies were blind, in other words the scientists homogenizing the data did not know where the inhomogeneities were. An interesting coincidence is that the people who generated the blind benchmarking data were outsiders at the time: Peter Thorne for NOAA and me for HOME. This probably explains why we both made an error, which we should not repeat in the ISTI.

Friday, 29 March 2013

Special issue on homogenisation of climate series

The open access Quarterly Journal of the Hungarian Meteorological Service "Időjárás" has just published a special issue on homogenization of climate records. This special issue contains eight research papers. It is an offspring of the COST Action HOME: Advances in homogenization methods of climate series: an integrated approach (COST-ES0601).

To be able to discuss eight papers, this post does not contain as much background information as usual and is aimed at people already knowledgeable about homogenization of climate networks.


Mónika Lakatos and Tamás Szentimrey: Editorial.
The editorial explains the background of this special issue: the importance of homogenisation and the COST Action HOME. Mónika and Tamás thank you very much for your efforts to organise this special issue. I think every reader will agree that it has become a valuable journal issue.

Monthly data

Ralf Lindau and Victor Venema: On the multiple breakpoint problem and the number of significant breaks in homogenization of climate records.
My article with Ralf Lindau is already discussed in a previous post on the multiple breakpoint problem.
José A. Guijarro: Climatological series shift test comparison on running windows.
Longer time series typically contain more than one inhomogeneity, but statistical tests are mostly designed to detect one break. One way to resolve this conflict is by applying these tests on short moving windows. José compares six statistical detection methods (t-test, Standard Normal Homogeneity Test (SNHT), two-phase regression (TPR), Wilcoxon-Mann-Whithney test, Durbin-Watson test and SRMD: squared relative mean difference), which are applied on running windows with a length between 1 and 5 years (12 to 60 values (months) on either side of the potential break). The smart trick of the article is that all methods are calibrated to a false alarm rate of 1% for better comparison. In this way, he can show that the t-test, SNHT and SRMD are best for this problem and almost identical. To get good detection rates, the window needs to be at least 2*3 years. As this harbours the risk of having two breaks in one window, José has decided to change his homogenization method CLIMATOL to using the semi-hierarchical scheme of SNHT instead of using windows. The methods are tested on data with just one break; it would have been interesting to also simulate the more realistic case with multiple independent breaks.
Olivier Mestre, Peter Domonkos, Franck Picard, Ingeborg Auer, Stéphane Robin, Emilie Lebarbier, Reinhard Böhm, Enric Aguilar, Jose Guijarro, Gregor Vertachnik, Matija Klan-car, Brigitte Dubuisson, and Petr Stepanek: HOMER: a homogenization software – methods and applications.
HOMER is a new homogenization method and is developed using the best methods tested on the HOME benchmark. Thus theoretically, this should be the best method currently available. Still, sometimes interactions between parts of an algorithm can lead to unexpected results. It would be great if someone would test HOMER on the HOME benchmark dataset, so that we can compare its performance with the other algorithms.

Thursday, 4 October 2012

Beta version of a new global temperature database released

Today, a first version of the global temperature dataset of the International Surface Temperature Initiative (ISTI) with 39 thousand stations has been released. The aim of the initiative is to provide an open and transparent temperature dataset for climate research.

The database is designed as a climate "sceptic" wet dream: the entire processing of the data will be performed with automatic open software. This includes every processing step from conversion to standard units, to merging stations to longer series, to quality control, homogenisation, gridding and computation of regional and global means. There will thus be no opportunity for evil climate scientists to fudge the data and create an artificially strong temperature trend.

It is planned that in many cases, you can go back to the digital images of the books or cards on which the observer noted down the temperature measurements. This will not be possible for all data. Many records have been keyed directly in the past, without making digital images. Sometimes the original data is lost, for instance in case of Austria, where the original daily observation have been lost in the Second World War and only the monthly means are still available from annual reports.

The ISTS also has a group devoted to data rescue to encourage people to go into the archives, image and key in the observations and upload this information to the database.

Tuesday, 10 January 2012

New article: Benchmarking homogenisation algorithms for monthly data

The main paper of the COST Action HOME on homogenisation of climate data has been published today in Climate of the Past. This post describes shortly the problem of inhomogeneities in climate data and how such data problems are corrected by homogenisation. The main part explains the topic of the paper, a new blind validation study of homogenisation algorithms for monthly temperature and precipitation data. All the most used and best algorithms participated.


To study climatic variability the original observations are indispensable, but not directly usable. Next to real climate signals they may also contain non-climatic changes. Corrections to the data are needed to remove these non-climatic influences, this is called homogenisation. The best known non-climatic change is the urban heat island effect. The temperature in cities can be warmer than on the surrounding country side, especially at night. Thus as cities grow, one may expect that temperatures measured in cities become higher. On the other hand, many stations have been relocated from cities to nearby, typically cooler, airports. Other non-climatic changes can be caused by changes in measurement methods. Meteorological instruments are typically installed in a screen to protect them from direct sun and wetting. In the 19th century it was common to use a metal screen on a North facing wall. However, the building may warm the screen leading to higher temperature measurements. When this problem was realised the so-called Stevenson screen was introduced, typically installed in gardens, away from buildings. This is still the most typical weather screen with its typical double-louvre door and walls. Nowadays automatic weather stations, which reduce labor costs, are becoming more common; they protect the thermometer by a number of white plastic cones. This necessitated changes from manually recorded liquid and glass thermometers to automated electrical resistance thermometers, which reduces the recorded temperature values.

One way to study the influence of changes in measurement techniques is by making simultaneous measurements with historical and current instruments, procedures or screens. This picture shows three meteorological shelters next to each other in Murcia (Spain). The rightmost shelter is a replica of the Montsouri screen, in use in Spain and many European countries in the late 19th century and early 20th century. In the middle, Stevenson screen equipped with automatic sensors. Leftmost, Stevenson screen equipped with conventional meteorological instruments.
Picture: Project SCREEN, Center for Climate Change, Universitat Rovira i Virgili, Spain.

A further example for a change in the measurement method is that the precipitation amounts observed in the early instrumental period (about before 1900) are biased and are 10% lower than nowadays because the measurements were often made on a roof. At the time, instruments were installed on rooftops to ensure that the instrument is never shielded from the rain, but it was found later that due to the turbulent flow of the wind on roofs, some rain droplets and especially snow flakes did not fall into the opening. Consequently measurements are nowadays performed closer to the ground.

Sunday, 8 January 2012

What distinguishes a benchmark?

Benchmarking is a community effort

Science has many terms for studying the validity or performance of scientific methods: testing, validation, intercomparison, verification, evaluation, and benchmarking. Every term has a different, sometimes subtly different, meaning. Initially I had wanted to compare all these terms with each other, but that would have become a very long post, especially as the meaning for every term is different in business, engineering, computation and science. Therefore, this post will only propose a definition for benchmarking in science and what distinguishes it from other approaches, casually called other validation studies from now on.

In my view benchmarking has three distinguishing features.
1. The methods are tested blind.
2. The problem is realistic.
3. Benchmarking is a community effort.
The term benchmark has become fashionable lately. It is also used, however, for validation studies that do not display these three features. This is not wrong, as there is no generally accepted definition of benchmarking. In fact in an important article on benchmarking by Sim et al. (2003) defines "a benchmark as a test or set of tests used to compare the performance of alternative tools or techniques." which would include any validation study. Then they limit the topic of their article, however, to interesting benchmarks, which are "created and used by a technical research community." However, if benchmarking is used for any type of validation study, there would not be any added value to the word. Thus I hope this post can be a starting point for a generally accepted and a more restrictive definition.