Thursday 4 October 2012

Beta version of a new global temperature database released

Today, a first version of the global temperature dataset of the International Surface Temperature Initiative (ISTI) with 39 thousand stations has been released. The aim of the initiative is to provide an open and transparent temperature dataset for climate research.

The database is designed as a climate "sceptic" wet dream: the entire processing of the data will be performed with automatic open software. This includes every processing step from conversion to standard units, to merging stations to longer series, to quality control, homogenisation, gridding and computation of regional and global means. There will thus be no opportunity for evil climate scientists to fudge the data and create an artificially strong temperature trend.

It is planned that in many cases, you can go back to the digital images of the books or cards on which the observer noted down the temperature measurements. This will not be possible for all data. Many records have been keyed directly in the past, without making digital images. Sometimes the original data is lost, for instance in case of Austria, where the original daily observation have been lost in the Second World War and only the monthly means are still available from annual reports.

The ISTS also has a group devoted to data rescue to encourage people to go into the archives, image and key in the observations and upload this information to the database.

To ensure that the automatic software is working well, there is a benchmarking and assessment working group (web | blog), of which I am a member. This working group will generate ten "copies" of the global database with flawless artificial station data and will introduce outliers and inhomogeneities into them. This data will also be processed with the automatic algorithms to be able to assess the quality of the various algorithms and see how well they removed the outliers and inhomogeneities.

Such a benchmarking will be performed every 3-year cycle, based on the then current state of the database. The final dataset with raw data for the first cycle of the ISTI is expected in January 2013. Soon after that, the benchmark dataset with the 10 artificial datasets will also be available. In this first cycle we will focus on the automatic homogenisation of the monthly data. At the end of the cycle the performance of the homogenisation algorithms will be assessed on the benchmark data and homogenised real data using the best algorithms will be provided.

Automatic versus manual processing

I like automatic software very much. Not only because this software is objective, but also because automatic methods can be validated much better and thus can be improved much faster. And just because software works automatically, does not mean that a climatologist should look carefully at the input and the results. Also automatic algorithms can use machine-readable metadata (metadata is data about data, for example, the dates that a station was moved, the instrumentation was changed or a heavy storm occurred)). The outliers and breaks that are found in a first run, can be validated by documentary metadata to improve the quality of the machine-readable metadata for a second run with such a software package.

Still, one should not be blind for the disadvantages of automatic methods. For example, in homogenisation a candidate station (that needs to be homogenised) is compared to its neighbours. The neighbours are used as reference to remove the complicated regional climate signal. A local climatologist will know which neighbouring stations are most similar and are best used as reference. An automatic method will typically look at the distance between stations, their height difference and how strongly two stations are cross-correlated. A climatologist may override such automatic selection criteria once in a while, for instance because one of the stations is near a glacier or a lake.

A local climatologist also has better access to metadata. A value that is seen as an outlier by an automatic algorithm may well be real and may be the topic of a story in a local newspaper. In this way, a global database with data that is quality controlled and homogenised by local climatologists, such as the HadCRU dataset, may be more complicated and less transparent, but may well be more accurate.

Thus I fully agree with the closing sentences of the recent article by Vincent et al. (2012) on the new Canadian temperature dataset:

It is important to encourage homogenization work carried out by scientists who are likely to have access to local data and metadata and who are familiar with their own local geography and climate variations. Homogenized data sets prepared at regional and national level can be helpful to complete and validate large/global homogenized data sets prepared by the scientific community.

I would add that it is also good to try to convert as much of the knowledge of the local climatologist as possible into automatic software and that we should pursue both roads.

More posts on homogenisation

Homogenisation of monthly and annual data from surface stations
A short description of the causes of inhomogeneities in climate data (non-climatic variability) and how to remove it using the relative homogenisation approach.
New article: Benchmarking homogenisation algorithms for monthly data
Raw climate records contain changes due to non-climatic factors, such as relocations of stations or changes in instrumentation. This post introduces an article that tested how well such non-climatic factors can be removed.
HUME: Homogenisation, Uncertainty Measures and Extreme weather
Proposal for future research in homogenisation of climate network data.
A short introduction to the time of observation bias and its correction
The time of observation bias is an important cause of inhomogeneities in temperature data.


Vincent, L. A., X. L. Wang, E. J. Milewska, H. Wan, F. Yang, and V. Swail (2012), A second generation of homogenized Canadian monthly surface air temperature for climate trend analysis, J. Geophys. Res., 117, D18110, doi:10.1029/2012JD017859.


  1. Will there be a means to subset from the database based upon lat long and/or irregular polygon shapes. The reason I ask is it is of climatological interest to be able to subset to individual small regions and then combine the corresponding stations. Will there be a method (similar to BEST) used for combining stations in gridcells?

    1. I cannot promise anything. The work on the dataset is voluntary, but I would expect that such facilities are planned. If you are interested and are willing to publish your code openly, you could also write programs to perform these tasks yourself.

      We are only at the beginning. Currently, only a first dataset and the software to merge the data from various sources into one consistent dataset are available.


Comments are welcome, but comments without arguments may be deleted. Please try to remain on topic. (See also moderation page.)

I read every comment before publishing it. Spam comments are useless.

This comment box can be stretched for more space.