Showing posts with label ISTI. Show all posts
Showing posts with label ISTI. Show all posts

Saturday, 16 January 2016

The transition to automatic weather stations. We’d better study it now.

This is a POST post.

The Parallel Observations Science Team (POST) is looking across the world for climate records which simultaneously measure temperature, precipitation and other climate variables with a conventional sensor (for example, a thermometer) and modern automatic equipment. You may wonder why we take the painstaking effort of locating and studying these records. The answer is easy: the transition from manual to automated records has an effect on climate series and the analysis we do over them.

In the last decades we have seen a major transition of the climate monitoring networks from conventional manual observations to automatic weather stations. It is recommended to compare these instruments before the substitution is effective with side by side measurements, which we call parallel measurements. Climatologists have also set up many longer experimental parallel measurements. They tell us that in most cases both sensors do not measure the same temperature or collect the same amount of precipitation. A different temperature is not only due to the change of the sensor itself, but automatic weather stations also often use a different, much smaller, screen to protect the sensor from the sun and the weather. Often the introduction of automatic weather stations is accompanied by a change in location and siting quality.

From studies of single temperature networks that made such a transition we know that it can cause large jumps; the observed temperatures at a station can go up or down by as much as 1°C. Thus potentially this transition can bias temperature trends considerably. We are now trying to build a global dataset with parallel measurements to be able to quantify how much the transition to automatic weather stations influences the global mean temperature estimates used to study global warming.


This study is led by Enric Aguilar and the preliminary results below were presented at the Data Management Workshop in Saint Gallen, Switzerland last November. We are still in the process of building up our dataset. Up to now we have data from 10 countries: Argentina (9 pairs), Australia (13), Brazil (4), Israel (5), Kyrgyzstan (1), Peru (31), Slovenia (3), Spain (46), Sweden (8), USA (6); see map below.

Global map in which we only display the 10 countries for which we have data. The left map is for the maximum temperature (TX) and the right for the minimum temperature (TN). Blue dots mean that the automatic weather station (AWS) measures cooler temperatures than the conventional observation, red dots mean the AWS is warmer. The size indicates how large the difference is, open circles are for statistically not significant differences.

The impact of the automation can be better assessed in the box plots below.

The bias of the individual pairs are shown as dots and summarized per country with box plots. For countries with only a few pairs the boxplots should be taken with a grain of salt. Negative values mean that the automatic weather stations are cooler. We have data for Argentina (AR), Australia (AU), Brazil (BR), Spain (ES), Israel (IL), Kyrgyzstan (KG), Peru (PE), Sweden (SE), Slovenia (SI) and the USA (US). Panels show the maximum temperature (TX), minimum temperature (TN), mean temperature (TM) and Diurnal temperature range (DTR, TX-TN).

On average there are no real biases in this dataset. However, if you remove Peru (PE) the differences in the mean temperature are either small or negative. That one country is so important shows that our dataset is currently too small.

To interpret the results we need to look at the main causes for the differences. Important reasons are that Stevenson screens can heat up in the sun on calm days, while automatic sensors are sometimes ventilated. The automatic sensors are, furthermore, typically smaller and thus less affected by direct radiation hitting them than thermometers. On the other hand, in case of conventional observation, the maintenance of the Stevenson screens—cleaning and painting—and detection of other problems may be easier because they have to be visited daily. There are concerns that plastic screens get more grey and heat more in the sun. Stevenson screens have more thermal inertia, they smooth fast temperature fluctuations, and will thus show lower highs and higher lows.

Also the location often changes with the installation of automatic weather stations. America was one of the early adopters. The US National Weather Service installed analogue semi-automatic equipment (MMTS) that did not allow for long cables between the sensor and the display inside a building. Furthermore, the technicians only had one day per station and as a consequence many of the MMTS systems were badly sited. Nowadays technology has advanced a lot and made it easier to find good sites for weather stations. This is maybe even easier now than it used to be for manual observations because modern communication is digital and if necessary uses radio making distance much less a concern. The instruments can be powered by batteries, solar or wind, which frees them from the electricity grid. Some instruments store years of data and need just batteries.

In the analysis we thus need to consider whether the automatic sensors are placed in Stevenson screens and whether the automatic weather station is at the same location. Where the screen and the location did not change (Israel and Slovenia), the temperature jumps are small. Whether the automatic weather station reduces radiation errors by mechanical ventilation is likely also important. Because of these different categories, the number of datasets needed to get a good global estimate becomes larger. Up to now, these factors seem to be more important than the climate.


For most of these countries we also have parallel measurements for precipitation. The figure below was made by Petr Stepanek, who leads this part of the study.

Boxplots for the differences in monthly precipitation sums due to automation. Positive values mean that the manual observations record more precipitation. Countries are: Argentina (AG), Brazil (BR), The Check Republic (CZ), Israel (IS), Kyrgyzstan (KG), Peru (PE), Sweden (SN), Spain (SP) and the USA (US). The width of the boxplots corresponds to the size of the given dataset.

For most countries the automatic weather stations record less precipitation. This is mainly due to smaller amounts of snow during the winter. Observers often put a snow cross in the gauge in winter to make it harder for snow to blow out of it again. Observers simply melt the snow gathered in a pot to measure precipitation, while early automatic weather stations did not work well with snow and sticky snow piling up in the gauge may not be noticed. These problems can be solved by heating the gauge, but unfortunately the heater can also increase the amount of precipitation that evaporates before it could be registered. Such problems are known and more modern rain gauges use different designs and likely have a smaller bias again.

Database with parallel data

The above results are very preliminary, but we wanted to show the promise of a global dataset with parallel data to study biases in the climate record due to changes in the observing practises. To proceed we need more datasets and better information on how the measurements were performed to make this study more solid.

In future we also want to look more at how the variability around the mean is changing. We expect that changes in monitoring practices have a strong influence on the tails of the distribution and thus on estimates of changes in extreme weather. Parallel data offer a unique opportunity to study this otherwise hard problem.

Most of the current data comes from Europe and South America. If you know of any parallel datasets especially from Africa or Asia, please let us know. Up to now, the main difficulty for this study is to find the persons who know where the data is. Fortunately, data policies do not seem to be a problem. Parallel data is mostly seen as experimental data. In some cases we “only” got a few years of data from a longer dataset, which would otherwise be seen as operational data.

We would like to publish the dataset after publishing our papers about it. Again this does not seem to lead to larger problems; sometimes people prefer to first publish an article themselves, which causes some delays, and sometimes we cannot publish the daily data itself, but “only” monthly averages and extreme value indices, this makes the results less transparent, but these summary values contain most of the information.

Knowledge of the observing practices is very important in the analysis. Thus everyone who contributes data is invited to help in the analysis of the data and co-author our first paper(s). Our studies are focused on global results, but we will also provide everyone with results for their own dataset to gain a better insight into their data.

Most climate scientists would agree that it is important to understand the impact of automation on our records. So does the World Meteorological Organization. In case it helps you to convince your boss: the Parallel Observations Science Team is part of the International Surface Temperature Initiative (ISTI). It is endorsed by the Task Team on Homogenization (TT-HOM) of the World Meteorological Organization (WMO).

We expect that this endorsement and our efforts to raise awareness about our goals and their importance will help us to locate and study parallel observations from other parts of the world, especially Africa and Asia. We also expect to be able to get more data from Europe; the regional association for Europe of the WMO has designated the transition to automatic weather stations as one of its priorities and is helping us to get access to more data. We want to have datasets for all over the world to be able to assess whether the station settings (sensors, screens, data quality, etc.) have an impact, but also to understand if different climates produce different biases.

If you would like to collaborate or have information, please contact me.

Related reading

The ISTI has made a series of brochures on POST in English, Spanish, French and German. If anyone is able to make further translations, that would be highly appreciated.

Parallel Observations Science Team of the International Surface Temperature Initiative.

Irrigation and paint as reasons for a cooling bias

Temperature trend biases due to urbanization and siting quality changes

Changes in screen design leading to temperature trend biases

Temperature bias from the village heat island

Sunday, 4 January 2015

How climatology treats sceptics

2014 was an exiting year for me, a lot happened. It could have gone wrong, my science project and thus employment ended. This would have been the ideal moment to easily get rid of me, no questions asked. But my follow-up project proposal (Daily HUME) to develop a new homogenization method for global temperature datasets was approved by the German Science Foundation.

It was an interesting year. The work I presented at conferences was very skeptical of our abilities to removed non-climatic changes from climate records (homogenization). Mitigation skeptics sometimes claim that my job, the job of all climate scientists, is to defend the orthodoxy. They might think that my skeptical work would at least hurt my career, if not make me an outright outcast, like they are.

Knowing science, I did not fear this. What counts is the quality of your arguments, not whether a trend goes up or down, whether a confidence interval becomes larger or smaller. As long as your arguments are strong, the more skeptical, the better, the more interesting the work is. What would hurt my reputation would be if my arguments were just as flimsy as those of the mitigation skeptics.

With a bunch colleagues we are working on a review paper on non-climatic changes in daily data. Daily data is used to study climatic changes in extreme weather: heat waves, cold spells, heavy rain, etc. Much too simplified we found that the limited evidence suggests that non-climatic changes affect the extremes more than the mean, that removing them is very hard, while most large daily data collections are not homogenized or only for changes in the mean. In other words, we found that the scientific literature supports the hunch of the climate skeptics of the IPCC:
"This [inhomogeneous data] affects, in particular, the understanding of extremes, because changes in extremes are often more sensitive to inhomogeneous climate monitoring practices than changes in the mean." Trenberth et al. (2007)
Not a nice message, but a large number of wonderful colleagues is happy to work with me on this review paper. Thank you for your trust.

Last May at the homogenization seminar in Budapest, I presented this work, while my colleague presented our joint work on homogenization when the size of the breaks is small. Or, formulated more technically: homogenization when the variance of the break signal is small relative to the variance of the difference time series (the difference between two nearby stations). The positions of the detected breaks are in this case not much better than random breaks. This problem was found by Ralf, a great analytical thinker and skeptic. Thank you for working with me.

Because my project ended and I did not know whether I would get the next one and especially not whether I would get it in time, I have asked two groups in Budapest whether they could support me during this bridge period. Both promised they would try. The next week the University of Bern offered me a job. Thank you Stefan and Renate, I had a wonderful time in Bern and learned a lot.

Thus my skeptical job is on track again and more good things happened. For the next good news I first have to explain some acronyms. The World Meteorological Organisation ([[WMO]]) coordinates the work of the (national) meteorological services around the world, for example by defining standards for measurements and data transfer. The WMO has a Commission for Climatology (CCl). For the coming 4-year term this commission has a new Task Team on Homogenization (TT-HOM). It cannot be much more than 2 years ago that I asked a colleague what this abbreviation he had used "CCl" stood for. Last spring they asked whether I wanted to be member of the TT-HOM. This autumn they made me chair. Thank you CCl and especially Thomas and Manola. I hope to be worthy of your trust.

Furthermore, I was asked to be co-convener of the session on Climate monitoring; data rescue, management, quality and homogenization at the Annual Meeting of the European Meteorological Society. That is quite an honor for a homogenization skeptic that is just an upstart.

More good things happened. While in Bern, Renate and I started working on a database with parallel measurements. In a parallel measurement an old measurement set-up stands next to a new one to directly compare the difference between them and to thus determine the non-climatic change this difference in set-ups produced. Because I am skeptical of our abilities to correct non-climatic changes in daily data, I hope that in this way we can study how important they are. A real skeptic does not just gloat when finding a problem, but tries to solve them as well. The good news is that the group of people working on this database is now a expert team of the International Surface Temperature Initiative (ISTI). Thank you ISTI steering committee and especially Peter.

In all this time, I had only one negative experience. After presenting our review article on daily data a colleague asked me whether I was a climate "skeptic". That was clearly intended as a threat, but knowing all those other colleagues behind me I could just laugh it off. In retrospect, my choice of words was also somewhat unfortunate. As an example, I had said that climatic changes in 20-year return levels (an extreme that happens on average every 20 years) probably cannot be studied using homogenized data given that the typical period between two non-climatic changes is 20 years. Unfortunately, this colleague afterwards presented a poster on climatic changes in the 20-year return period. Had I known that, I would have chosen another example. No hard feelings.

That is how climatology treats skeptics. I cannot complain. On the contrary, a lot of people supported me.

If you can complain, if you feel like a persecuted heretic (and not only claim that as part of your political fight), you may want to reconsider whether your arguments are really that strong. You are always welcome back.

A large part of the homogenization community at a project meeting in Bucharest 2010. They make a homogenization skeptic feel at home. Love you guys.

Eric Steig strongly criticized the IPPC, his experience (archive):

I was highly critical of IPCC AR4 Chapter 6, so much so that the [mitigation skeptical] Heartland Institute repeatedly quotes me as evidence that the IPCC is flawed. Indeed, I have been unable to find any other review as critical as mine. I know "because they told me" that my reviews annoyed many of my colleagues, including some of my RC colleagues, but I have felt no pressure or backlash whatsover from it. Indeed, one of the Chapter 6 lead authors said “Eric, your criticism was really harsh, but helpful "thank you!"

So who are these brilliant young scientists whose careers have been destroyed by the supposed tyranny of the IPCC? Examples?

James Annan later writes:
Well, I don't think I got quite such a rapturous response as Eric did, with my attempts to improve the AR4 drafts, but I certainly didn't get trampled and discredited either [which Judith Curry evidently wrongly claims the IPCC does] - merely made to feel mildly unwelcome, which I find tends to happen when I criticise people outside the IPCC too. But they did change the report in various ways. While I'm not an unalloyed fan of the IPCC process, my experience is not what she [Judith Curry] describes it as. So make that two anecdotes.

Maybe people could start considering whether there is a difference between qualified critique and uninformed nonsense. Valuing quality is part of the scientific culture.

Related posts

On consensus and dissent in science - consensus signals credibility

Why doesn't Big Oil fund alternative climate research?

Are debatable scientific questions debatable?

Falsifiable and falsification in science

Peer review helps fringe ideas gain credibility


Trenberth, K.E., et al., 2007: Observations: Surface and Atmospheric Climate Change. In: Climate Change 2007: The Physical Science Basis. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.

Wednesday, 8 October 2014

A framework for benchmarking of homogenisation algorithm performance on the global scale - Paper now published

By Kate Willett reposted from the Surface Temperatures blog of the International Surface Temperature Initiative (ISTI).

The ISTI benchmarking working group have just had their first benchmarking paper accepted at Geoscientific Instrumentation, Methods and Data Systems:

Willett, K., Williams, C., Jolliffe, I. T., Lund, R., Alexander, L. V., Brönnimann, S., Vincent, L. A., Easterbrook, S., Venema, V. K. C., Berry, D., Warren, R. E., Lopardo, G., Auchmann, R., Aguilar, E., Menne, M. J., Gallagher, C., Hausfather, Z., Thorarinsdottir, T., and Thorne, P. W.: A framework for benchmarking of homogenisation algorithm performance on the global scale, Geosci. Instrum. Method. Data Syst., 3, 187-200, doi:10.5194/gi-3-187-2014, 2014.

Benchmarking, in this context, is the assessment of homogenisation algorithm performance against a set of realistic synthetic worlds of station data where the locations and size/shape of inhomogeneities are known a priori. Crucially, these inhomogeneities are not known to those performing the homogenisation, only those performing the assessment. Assessment of both the ability of algorithms to find changepoints and accurately return the synthetic data to its clean form (prior to addition of inhomogeneity) has three main purposes:

1) quantification of uncertainty remaining in the data due to inhomogeneity
2) inter-comparison of climate data products in terms of fitness for a specified purpose
3) providing a tool for further improvement in homogenisation algorithms

Here we describe what we believe would be a good approach to a comprehensive homogenisation algorithm benchmarking system. Thfis includes an overarching cycle of: benchmark development; release of formal benchmarks; assessment of homogenised benchmarks and an overview of where we can improve for next time around (Figure 1).

Figure 1 Overview the ISTI comprehensive benchmarking system for assessing performance of homogenisation algorithms. (Fig. 3 of Willett et al., 2014)

There are four components to creating this benchmarking system.

Creation of realistic clean synthetic station data
Firstly, we must be able to synthetically recreate the 30000+ ISTI stations such that they have the correct variability, auto-correlation and interstation cross-correlations as the real data but are free from systematic error. In other words, they must contain a realistic seasonal cycle and features of natural variability (e.g., ENSO, volcanic eruptions etc.). There must be a realistic persistence month-to-month in each station and geographically across nearby stations.

Creation of realistic error models to add to the clean station data
The added inhomogeneities should cover all known types of inhomogeneity in terms of their frequency, magnitude and seasonal behaviour. For example, inhomogeneities could be any or a combination of the following:

- geographically or temporally clustered due to events which affect entire networks or regions (e.g. change in observation time);
- close to end points of time series;
- gradual or sudden;
- variance-altering;
- combined with the presence of a long-term background trend;
- small or large;
- frequent;
- seasonally or diurnally varying.

Design of an assessment system
Assessment of the homogenised benchmarks should be designed with the three purposes of benchmarking in mind. Both the ability to correctly locate changepoints and to adjust the data back to its homogeneous state are important. It can be split into four different levels:

- Level 1: The ability of the algorithm to restore an inhomogeneous world to its clean world state in terms of climatology, variance and trends.

- Level 2: The ability of the algorithm to accurately locate changepoints and detect their size/shape.

- Level 3: The strengths and weaknesses of an algorithm against specific types of inhomogeneity and observing system issues.

- Level 4: A comparison of the benchmarks with the real world in terms of detected inhomogeneity both to measure algorithm performance in the real world and to enable future improvement to the benchmarks.

The benchmark cycle
This should all take place within a well laid out framework to encourage people to take part and make the results as useful as possible. Timing is important. Too long a cycle will mean that the benchmarks become outdated. Too short a cycle will reduce the number of groups able to participate.

Producing the clean synthetic station data on the global scale is a complicated task that has now taken several years but we are close to completion of a version 1. We have collected together a list of known regionwide inhomogeneities and a comprehensive understanding of the many many different types of inhomogeneities that can affect station data. We have also considered a number of assessment options and decided to focus on levels 1 and 2 for assessment within the benchmark cycle. Our benchmarking working group is aiming for release of the first benchmarks by January 2015.

Wednesday, 27 August 2014

A database with parallel climate measurements

By Renate Auchmann and Victor Venema

A parallel measurement with a Wild screen and a Stevenson screen in Basel, Switzerland. Double-Louvre Stevenson screens protect the thermometer well against influences of solar and heat radiation. The half-open Wild screens provide more ventilation, but were found to be affected too much by radiation errors. In Switzerland they were substituted by Stevenson screens in the 1960s.

We are building a database with parallel measurements to study non-climatic changes in the climate record. In a parallel measurement, two or more measurement set-ups are compared to each other at one location. Such data is analyzed to see how much a change from one set-up to another affects the climate record.

This post will first give a short overview of the problem, some first achievements and will then describe our proposal for a database structure. This post's main aim is to get some feedback on this structure.

Parallel measurements

Quite a lot of parallel measurements are performed, see this list for a first selection of datasets we found, however they have often only been analyzed for a change in the mean. This is a pity because parallel measurements are especially important for studies on non-climatic changes in weather extremes and weather variability.

Studies on parallel measurements typically analyze single pairs of measurements, in the best cases a regional network is studied. However, the instruments used are often somewhat different in different networks and the influence of a certain change depends on the local weather and climate. Thus to draw solid conclusions about the influence of a specific change on large-scale (global) trends, we need large datasets with parallel measurements from many locations.

Studies on changes in the mean can be relatively easily compared with each other to get a big picture. But changes in the distribution can be analyzed in many different ways. To be able to compare changes found at different locations, the analysis needs to be performed in the same way. To facilitate this, gathering the parallel data in a large dataset is also beneficial.


Quite a number of people stand behind this initiative. The International Surface Temperature Initiative and the European Climate Assessment & Dataset have offered to host a copy of the parallel dataset. This ensures the long term storage of the dataset. The World Meteorological Organization (WMO) has requested its members to help build this databank and provide parallel datasets.

However, we do not have any funding. Last July, at the SAMSI meeting on the homogenization of the ISTI benchmark, people felt we can no longer wait for funding and it is really time to get going. Furthermore, Renate Auchmann offered to invest some of her time on the dataset; that doubles the man power. Thus we have decided to simply start and see how far we can get this way.

The first activity was a one-page information leaflet with some background information on the dataset, which we will send to people when requesting data. The second activity is this blog post: a proposal for the structure of the dataset.

Upcoming tasks are the documentation of the directory and file formats, so that everyone can work with it. The data processing from level to level needs to be coded. The largest task is probably the handling of the metadata (data about the data). We will have to complete a specification for the metadata needed. A webform where people can enter this information would be great. (Does anyone have ideas for a good tool for such a webform?) And finally the dataset will have to be filled and analyzed.

Design considerations

Given the limited manpower, we would like to keep it as simple as possible at this stage. Thus data will be stored in text files and the hierarchical database will simply use a directory tree. Later on, a real database may be useful, especially to make it easier to select the parallel measurements one is interested in.

Next to the parallel measurements, also related measurements should be stored. For example, to understand the differences between two temperature measurements, additional measurements (co-variates) on, for example, insolation, wind or cloud cover are important. Also metadata needs to be stored and should be machine readable as much as possible. Without meta-information on how the parallel measurement was performed, the data is not useful.

We are interested in parallel data from any source, variable and temporal resolution. High resolution (sub-daily) data is very important for understanding the reasons for any differences. There is probably more data, especially historical data, available for coarser resolutions and this data is important for studying non-climatic changes in the means.

However, we will scientifically focus on changes in the distribution of daily temperature and precipitation data in the climate record. Thus, we will compute daily averages from sub-daily data and will use these to compute the indices of the Expert Team on Climate Change Detection and Indices (ETCCDI), which are often used in studies on changes in “extreme” weather. Actively searching for data, we will prioritize instruments that were much used to perform climate measurements and early historical measurements, which are more rare and are expected to show larger changes.

Following the principles of the ISTI, we aim to be an open dataset with good provenance, that is, it should be possible to tell were the data comes from. For this reason, the dataset will have levels with increasing degrees of processing, so that one can go back to a more primitive level if one finds something interesting/suspicious.

For this same reason, the processing software will also be made available and we will try to use open software (especially the free programming language R, which is widely used in statistical climatology) as much as possible.

It will be an open dataset in the end, but as an incentive to contribute to the dataset, initially only contributors will be able to access the data. After joint publications, the dataset will be opened for academic research as a common resource for the climate sciences. In any case people using the data of a small number of sources are requested to explicitly cite them, so that contributing to the dataset also makes the value of making parallel measurements visible.

Database structure

The basic structure has 5 levels.

0: Original, raw data (e.g. images)
1: Native format data (as received)
2: Data in a standard format at original resolution
3: Daily data
4: ETCCDI indices

In levels 2, 3 & 4 we will provide information on outliers and inhomogeneities.

Especially for the study of extremes, the removal of outliers is important. Suggestions for good software that would work for all climate regions is welcome.

Longer parallel measurements may, furthermore, also contain inhomogeneities. We will not homogenize the data, because we want to study the raw data, but we will detect breaks and provide their date and size as metadata, so that the user can work on homogeneous subperiods if interested. This detection will probably be performed at monthly or annual scales with one of the HOME recommended methods.

Because parallel measurements will tend to be well correlated, it is possible that statistically significant inhomogeneities are very small and climatologically irrelevant. Thus we will also provide information on the size of the inhomogeneity so that the user can decide whether such a break is problematic for this specific application or whether having longer time series is more important.

Level 0 - images

If possible, we will also store the images of the raw data records. This enables the user to see if an outlier may be caused by unclear handwriting or whether the observer explicitly wrote that the weather was severe that day.

In case the normal measurements are already digitized, only the parallel one needs to be transcribed. In this case the number of values will be limited and we may be able to do so. Both Bern and Bonn have facilities to digitize climate data.

Level 1 – native format

Even if it will be more work for us, we would like to receive the data in its native format and will convert it ourselves to a common standard format. This will allow the users to see if mistakes were made in the conversion and allows for their correction.

Level 2 – standard format

In the beginning our standard format will be an ASCII format. Later on we may also use a scientific data format such as NetCDF. The format will be similar to the one of the COST Action HOME. Some changes will be needed to the filenames account for multiple measurements of the same variable at one station and for multiple indices computed from the same variable.

Level 3 - daily data

We expect that an important use of the dataset will be the study of non-climatic changes in daily data. At this level we will thus gather the daily datasets and convert the sub-daily datasets to daily.

Level 4 – ETCCDI indices

Many people use the indices to the ETCCDI to study changes in extreme weather. Thus we will precompute these indices. Also in case government policies do not allow giving out the daily data, it may sometimes be possible to obtain the indices. The same strategy is also used by the ETCCDI in regions where data availability is scarce and/or data accessibility is difficult.

Directory structure

In the main directory there are the sub-directories: data, documentation, software and articles.

In the sub-directory data there are sub-directories for the data sources with names d###; with d for data source and ### is a running number of arbitrary length.

In these directories there are up to 5 sub-directories with the levels and one directory with “additional” metadata such as photos and maps that cannot be copied in every level.

In the level 0 and level 1 directories, climate data, the flag files and the machine readable metadata are directly in this directory.

Because one data source can contain more than one station, in the levels 2 and higher there are sub-directories for the various stations. These sub-directories will be called s###; with s for station.

Once we have more data and until we have a real database, we may also provide a directory structure first ordered by the 5 levels.

The filenames will contain information on the station and variable. In the root directory we will provide machine readable tables detailing which variables can be found in which directories. So that people interested in a certain variable know which directories to read.

For the metadata we are currently considering using XML, which can be read into R. (Are the similar packages for Matlab and FORTRAN?) Suggestions for other options are welcome.

What do you think? Is this a workable structure for such a dataset? Suggestions welcome in the comments or also by mail (Victor Venema & Renate Auchmann ).

Related reading

A database with daily climate data for more reliable studies of changes in extreme weather
The previous post provides more background on this project.
CHARMe: Sharing knowledge about climate data
An EU project to improve the meta information and therewith make climate data more easily usable.
List of Parallel climate measurements
Our Wiki page listing a large number of resources with parallel data.
Future research in homogenisation of climate data – EMS 2012 in Poland
A discussion on homogenisation at a Side Meeting at EMS2012
What is a change in extreme weather?
Two possible definitions, one for impact studies, one for understanding.
HUME: Homogenisation, Uncertainty Measures and Extreme weather
Proposal for future research in homogenisation of climate network data.
Homogenization of monthly and annual data from surface stations
A short description of the causes of inhomogeneities in climate data (non-climatic variability) and how to remove it using the relative homogenization approach.
New article: Benchmarking homogenization algorithms for monthly data
Raw climate records contain changes due to non-climatic factors, such as relocations of stations or changes in instrumentation. This post introduces an article that tested how well such non-climatic factors can be removed.

Friday, 27 June 2014

Self-review of problems with the HOME validation study for homogenization methods

In my last post, I argued that post-publication review is no substitute for pre-publication review, but it could be a nice addition.

This post is a post-publication self-review, a review of our paper on the validation of statistical homogenization methods, also called benchmarking when it is a community effort. Since writing this benchmarking article we have understood the problem better and have found some weaknesses. I have explained these problems on conferences, but for the people that did not hear them, please find them below after a short introduction. We have a new paper in open review that explains how we want to do better in the next benchmarking study.

Benchmarking homogenization methods

In our benchmarking paper we generated a dataset that mimicked real temperature or precipitation data. To this data we added non-climatic changes (inhomogeneities). We requested the climatologists to homogenize this data, to remove the inhomogeneities we had inserted. How good the homogenization algorithms are can be seen by comparing the homogenized data to the original homogeneous data.

This is straightforward science, but the realism of the dataset was the best to date and because this project was part of a large research program (the COST Action HOME) we had a large number of contributions. Mathematical understanding of the algorithms is also important, but homogenization algorithms are complicated methods and it is also possible to make errors in the implementation, thus such numerical validations are also valuable. Both approaches complement each other.

Group photo at a meeting of the COST Action HOME with most of the European homogenization community present. These are those people working in ivory towers, eating caviar from silver plates, drinking 1985 Romanee-Conti Grand Cru from crystal glasses and living in mansions. Enjoying the good live on the public teat, while conspiring against humanity.

The main conclusions were that homogenization improves the homogeneity of temperature data. Precipitation is more difficult and only the best algorithms were able to improve it. We found that modern methods improved the quality of temperature data about twice as much as traditional methods. It is thus important that people switch to one of these modern methods. My impression from the recent Homogenisation seminar and the upcoming European Meteorological Society (EMS) meeting is that this seems to be happening.

1. Missing homogenization methods

An impressive number of methods participated in HOME. Also many manual methods were applied, which are validated less because this is more work. All the state-of-the-art methods participated and most of the much used methods. However, we forgot to test a two- or multi-phase regression method, which is popular in North America.

Also not validated is HOMER, the algorithm that was designed afterwards using the best parts of the tested algorithms. We are working on this. Many people have started using HOMER. Its validation should thus be a high priority for the community.

2. Size breaks (random walk or noise)

Next to the benchmark data with the inserted inhomogeneities, we also asked people to homogenize some real datasets. This turned out to be very important because it allowed us to validate how realistic the benchmark data is. Information we need to make future studies more realistic. In this validation we found that the size of the benchmark in homogeneities was larger than those in the real data. Expressed as the standard deviation of the break size distribution, the benchmark breaks were typically 0.8°C and the real breaks were only 0.6°C.

This was already reported in the paper, but we now understand why. In the benchmark, the inhomogeneities were implemented by drawing a random number for every homogeneous period and perturbing the original data by this amount. In other words, we added noise to the homogeneous data. However, the homogenizers that requested to make breaks with a size of about 0.8°C were thinking of the difference from one homogeneous period to the next. The size of such breaks is influenced by two random numbers. Because variances are additive, this means that the jumps implemented as noise were the square root of two (about 1.4) times too large.

The validation showed that, except for the size, the idea of implementing the inhomogeneities as noise was a good approximation. The alternative would be to draw a random number and use that to perturb the data relative to the previously perturbed period. In that case you implement the inhomogeneities as a random walk. Nobody thought of reporting it, but it seems that most validation studies have implemented their inhomogeneities as random walks. This makes the influence of the inhomogeneities on the trend much larger. Because of the larger error, it is probably easier to achieve relative improvements, but because the initial errors were absolutely larger, the absolute errors after homogenization may well have been too large in previous studies.

You can see the difference between a noise perturbation and a random walk by comparing the sign (up or down) of the breaks from one break to the next. For example, in case of noise and a large upward jump, the next change is likely to make the perturbation smaller again. In case of a random walk, the size and sign of the previous break is irrelevant. The likeliness of any sign is one half.

In other words, in case of a random walk there are just as much up-down and down-up pairs as there are up-up and down-down pairs, every combination has a chance of one in four. In case of noise perturbations, up-down and down-up pairs (platform-like break pairs) are more likely than up-up and down-down pairs. The latter is what we found in the real datasets. Although there is a small deviation that suggests a small random walk contribution, but that may also be because the inhomogeneities cause a trend bias.

3. Signal to noise ratio varies regionally

The HOME benchmark reproduced a typical situation in Europe (the USA is similar). However, the station density in much of the world is lower. Inhomogeneities are detected and corrected by comparing a candidate station to neighbouring ones. When the station density is less, this difference signal is more noisy and this makes homogenization more difficult. Thus one would expect that the performance of homogenization methods is lower in other regions. Although, also the break frequency and break size may be different.

Thus to estimate how large the influence of the remaining inhomogeneities can be on the global mean temperature, we need to study the performance of homogenization algorithms in a wider range of situations. Also for the intercomparison of homogenization methods (the more limited aim of HOME) the signal (break size) to noise ratio is important. Domonkos (2013) showed that the ranking of various algorithms depends on the signal to noise ratio. Ralf Lindau and I have just submitted a manuscript that shows that for low signal to noise ratios, the multiple breakpoint method PRODIGE is not much better in detecting breaks than a method that would "detect" random breaks, while it works fine for higher signal to noise ratios. Other methods may also be affected, but possibly not in the same amount. More on that later.

4. Regional trends (absolute homogenization)

The initially simulated data did not have a trend, thus we explicitly added a trend to all stations to give the data a regional climate change signal. This trend could be both upward or downward, just to check whether homogenization methods might have problems with downward trends, which are not typical of daily operations. They do not.

Had we inserted a simple linear trend in the HOME benchmark data, the operators of the manual homogenization could have theoretically used this information to improve their performance. If the trend is not linear, there are apparently still inhomogeneities in the data. We wanted to keep the operators in the blind. Consequently, we inserted a rather complicated and variable nonlinear trend in the dataset.

As already noted in the paper, this may have handicapped the participating absolute homogenization method. Homogenization methods used in climate are normally relative ones. These methods compare a station to its neighbours, both have the same regional climate signal, which is thus removed and not important. Absolute methods do not use the information from the neighbours; these methods have to make assumptions about the variability of the real regional climate signal. Absolute methods have problems with gradual inhomogeneities and are less sensitive and are therefore not used much.

If absolute methods are participating in future studies, the trend should be modelled more realistically. When benchmarking only automatic homogenization methods (no operator) an easier trend should be no problem.

5. Length of the series

The station networks simulated in HOME were all one century long, part of the stations were shorter because we also simulated the build up of the network during the first 25 years. We recently found that criterion for the optimal number of break inhomogeneities used by one of the best homogenization methods (PRODIGE) does not have the right dependence on the number of data points (Lindau and Venema, 2013). For climate datasets that are about a century long, the criterion is quite good, but for much longer or shorter datasets there are deviations. This illustrates that the length of the datasets is also important and that it is important for benchmarking that the data availability is the same as in real datasets.

Another reason why it is important that the benchmark data availability to be the same as in the real dataset is that this makes the comparison of the inhomogeneities found in the real data and in the benchmark more straightforward. This comparison is important to make future validation studies more accurate.

6. Non-climatic trend bias

The inhomogeneities we inserted in HOME were on average zero. For the stations this still results in clear non-climatic trend errors because you only average over a small number of inhomogeneities. For the full networks the number of inhomogeneities is larger and the non-climatic trend error thus very small. It was consequently very hard for the homogenization methods to improve this small errors. It is expected that in real raw datasets there is a larger non-climatic error. Globally the non-climatic trend will be relatively small, but within one network, where the stations experienced similar (technological and organisational) changes, it can be appreciable. Thus we should model such a non-climatic trend bias explicitly in future.

International Surface Temperature Initiative

The last five problems will be solved in the International Surface Temperature Initiative (ISTI) benchmark . Whether a two-phase homogenization method will participate is beyond our control. We do expect less participants than in HOME because for such a huge global dataset, the homogenization methods will need to be able to run automatically and unsupervised.

The standard break sizes will be made smaller. We will make ten benchmarking "worlds" with different kinds of inserted inhomogeneities and will also vary the size and number of the inhomogeneities. Because the ISTI benchmarks will mirror the real data holdings of the ISTI, the station density and the length of the data will be the same. The regional climate signal will be derived from a global circulation models and absolute methods could thus participate. Finally, we will introduce a clear non-climate trend bias to several of the benchmark "worlds".

The paper on the ISTI benchmark is open for discussions at the journal Geoscientific Instrumentation, Methods and Data Systems. Please find the abstract below.

The International Surface Temperature Initiative (ISTI) is striving towards substantively improving our ability to robustly understand historical land surface air temperature change at all scales. A key recently completed first step has been collating all available records into a comprehensive open access, traceable and version-controlled databank. The crucial next step is to maximise the value of the collated data through a robust international framework of benchmarking and assessment for product intercomparison and uncertainty estimation. We focus on uncertainties arising from the presence of inhomogeneities in monthly surface temperature data and the varied methodological choices made by various groups in building homogeneous temperature products. The central facet of the benchmarking process is the creation of global scale synthetic analogs to the real-world database where both the "true" series and inhomogeneities are known (a luxury the real world data do not afford us). Hence algorithmic strengths and weaknesses can be meaningfully quantified and conditional inferences made about the real-world climate system. Here we discuss the necessary framework for developing an international homogenisation benchmarking system on the global scale for monthly mean temperatures. The value of this framework is critically dependent upon the number of groups taking part and so we strongly advocate involvement in the benchmarking exercise from as many data analyst groups as possible to make the best use of this substantial effort.

Related reading

Nick Stokes made a beautiful visualization of the raw temperature data in the ISTI database. Homogenized data where non-climatic trends have been removed is unfortunately not yet available, that will be released together with the results of the benchmark.

New article: Benchmarking homogenisation algorithms for monthly data. The post describing the HOME benchmarking article.

New article on the multiple breakpoint problem in homogenization. Most work in statistics is about data with just one break inhomogeneity (change point). In climate there are typically more breaks. Methods designed for multiple breakpoints are more accurate.

Part 1 of a series on Five statistically interesting problems in homogenization.


Domonkos, P., 2013: Efficiencies of Inhomogeneity-Detection Algorithms: Comparison of Different Detection Methods and Efficiency Measures. Journal of Climatology, Art. ID 390945, doi: 10.1155/2013/390945.

Lindau and Venema, 2013: On the multiple breakpoint problem and the number of significant breaks in homogenization of climate records. Idojaras, Quarterly Journal of the Hungarian Meteorological Service, 117, No. 1, pp. 1-34. See also my post: New article on the multiple breakpoint problem in homogenization.

Lindau and Venema, to be submitted, 2014: The joint influence of break and noise variance on the break detection capability in time series homogenization.

Willett, K., Williams, C., Jolliffe, I., Lund, R., Alexander, L., Brönniman, S., Vincent, L. A., Easterbrook, S., Venema, V., Berry, D., Warren, R., Lopardo, G., Auchmann, R., Aguilar, E., Menne, M., Gallagher, C., Hausfather, Z., Thorarinsdottir, T., and Thorne, P. W.: Concepts for benchmarking of homogenisation algorithm performance on the global scale, Geosci. Instrum. Method. Data Syst. Discuss., 4, 235-270, doi: 10.5194/gid-4-235-2014, 2014.

Tuesday, 26 November 2013

Are break inhomogeneities a random walk or a noise?

Tomorrow is the next conference call of the benchmarking and assessment working group (BAWG) of the International Surface Temperature Initiative (ISTI; Thorne et al., 2011). The BAWG will create a dataset to benchmark (validate) homogenization algorithm. It will mimic the real mean temperature data of the ISTI, but will include know inhomogeneities, so that we can assess how well the homogenization algorithms remove them. We are almost finished discussing how the benchmark dataset should be developed, but still need to fix some details. Such as the question: Are break inhomogeneities a random walk or a noise?

Previous studies

The benchmark dataset of the ISTI will be global and is also intended to be used to estimate uncertainties in the climate signal due to remaining inhomogeneities. These are the two main improvements over previous validation studies.

Williams, Menne, and Thorne (2012) validated the pairwise homogenization algorithm of NOAA on a dataset mimicking the US Historical Climate Network. The paper focusses on how well large-scale biases can be removed.

The COST Action HOME has performed a benchmarking of several small networks (5 to 19 stations) realistically mimicking European climate networks (Venema et al., 2012). It main aim was to intercompare homogenization algorithms, the small networks allowed HOME to also test manual homogenization methods.

These two studies were blind, in other words the scientists homogenizing the data did not know where the inhomogeneities were. An interesting coincidence is that the people who generated the blind benchmarking data were outsiders at the time: Peter Thorne for NOAA and me for HOME. This probably explains why we both made an error, which we should not repeat in the ISTI.