Showing posts with label benchmarking. Show all posts
Showing posts with label benchmarking. Show all posts

Friday, January 22, 2021

New paper: Spanish and German climatologists on how to remove errors from observed climate trends

This picture shows three meteorological shelters next to each other in Murcia (Spain). The rightmost shelter is a replica of the Montsouri (French) screen, in use in Spain and many European countries in the late 19th century and early 20th century. Leftmost, Stevenson screen equipped with conventional meteorological instruments, a set-up used globally for most of the 20th century. In the middle, Stevenson screen equipped with automatic sensors. The Montsouri screen is better ventilated, but because some solar radiation can get onto the thermometer it registers somewhat higher temperatures than a Stevenson screen. Picture: Project SCREEN, Center for Climate Change, Universitat Rovira i Virgili, Spain.

The instrumental climate record is human cultural heritage, the product of the diligent work of many generations of people all over the world. But changes in the way temperature was measured and in the surrounding of weather stations can produce spurious trends. An international team, with participation of the University Rovira i Virgili (Spain), State Meteorological Agency (AEMET, Spain) and University of Bonn (Germany), has made a great endeavour to provide reliable tests for the methods used to computationally eliminate such spurious trends. These so-called “homogenization methods“ are a key step to turn the enormous effort of the observers into accurate climate change data products. The results have been published in the prestigious Journal of Climate of the American Meteorological Society. The research was funded by the Spanish Ministry of Economy and Competitiveness.

Climate observations often go back more than a century, to times before we had electricity or cars. Such long time spans make it virtually impossible to keep the measurement conditions the same across time. The best-known problem is the growth of cities around urban weather stations. Cities tend to be warmer, for example due to reduced evaporation by plants or because high buildings block cooling. This can be seen comparing urban stations with surrounding rural stations. It is less talked about, but there are similar problems due to the spread of irrigation.

The most common reason for jumps in the observed data are relocations of weather stations. Volunteer observers tend to make observations near their homes; when they retire and a new volunteer takes over the tasks, this can produce temperature jumps. Even for professional observations keeping the locations the same over centuries can be a challenge either due to urban growth effects making sites unsuitable or organizational changes leading to new premises. Climatologist from Bonn, Dr. Victor Venema, one of the authors: “a quite typical organizational change is that weather offices that used to be in cities were transferred to newly build airports needing observations and predictions. The weather station in Bonn used to be on a field in village Poppelsdorf, which is now a quarter of Bonn and after several relocations the station is currently at the airport Cologne-Bonn.

For global trends, the most important changes are technological changes of the same kinds and with similar effects all over the world. Now we are, for instance, in a period with widespread automation of the observational networks.

Appropriate computer programs for the automatic homogenization of climatic time series are the result of several years of development work. They work by comparing nearby stations with each other and looking for changes that only happen in one of them, as opposed to climatic changes that influence all stations.

To scrutinize these homogenization methods the research team created a dataset that closely mimics observed climate datasets including the mentioned spurious changes. In this way, the spurious changes are known and one can study how well they are removed by homogenization. Compared to previous studies, the testing datasets showed much more diversity; real station networks also show a lot of diversity due to differences in their management. The researchers especially took care to produce networks with widely varying station densities; in a dense network it is easier to see a small spurious change in a station. The test dataset was larger than ever containing 1900 station networks, which allowed the scientists to accurately determine the differences between the top automatic homogenization methods that have been developed by research groups from Europe and the Americas. Because of the large size of the testing dataset, only automatic homogenization methods could be tested.

The international author group found that it is much more difficult to improve the network-mean average climate signals than to improve the accuracy of station time series.

The Spanish homogenization methods excelled. The method developed at the Centre for Climate Change, Univ. Rovira i Virgili, Vila-seca, Spain, by Hungarian climatologist Dr. Peter Domonkos was found to be the best at homogenizing both individual station series and regional network mean series. The method of the State Meteorological Agency (AEMET), Unit of Islas Baleares, Palma, Spain, developed by Dr. José A. Guijarro was a close second.

When it comes to removing systematic trend errors from many networks, and especially of networks where alike spurious changes happen in many stations at similar dates, the homogenization method of the American National Oceanic and Atmospheric Agency (NOAA) performed best. This is a method that was designed to homogenize station datasets at the global scale where the main concern is the reliable estimation of global trends.

The earlier used Open Screen used at station Uccle in Belgium, with two modern closed thermometer Stevenson screens with a double-louvred walls in the background.

Quotes from participating researchers

Dr. Peter Domonkos, who earlier was a weather observer and now writes a book about time series homogenization: “This study has shown the value of large testing datasets and demonstrates another reason why automatic homogenization methods are important: they can be tested much better, which aids their development.

Prof. Dr. Manola Brunet, who is the director of the Centre for Climate Change, Univ. Rovira i Virgili, Vila-seca, Spain, Visiting Fellow at the Climatic Research Unit, University of East Anglia, Norwich, UK and Vice-President of the World Meteorological Services Technical Commission said: “The study showed how important dense station networks are to make homogenization methods powerful and thus to compute accurate observed trends. Unfortunately, still a lot of climate data needs to be digitized to contribute to an even better homogenization and quality control.

Dr. Javier Sigró from the Centre for Climate Change, Univ. Rovira i Virgili, Vila-seca, Spain: “Homogenization is often a first step that allows us to go into the archives and find out what happened to the observations that produced the spurious jumps. Better homogenization methods mean that we can do this in a much more targeted way.

Dr. José A. Guijarro: “Not only the results of the project may help users to choose the method most suited to their needs; it also helped developers to improve their software showing their strengths and weaknesses, and will allow further improvements in the future.

Dr. Victor Venema: “In a previous similar study we found that homogenization methods that were designed to handle difficult cases where a station has multiple spurious jumps were clearly better. Interestingly, this study did not find this. It may be that it is more a matter of methods being carefully fine-tuned and tested.

Dr. Peter Domonkos: “The accuracy of homogenization methods will likely improve further, however, we never should forget that the spatially dense and high quality climate observations is the most important pillar of our knowledge about climate change and climate variability.

Press releases

Spanish weather service, AEMET: Un equipo internacional de climatólogos estudia cómo minimizar errores en las tendencias climáticas observadas

URV university in Tarragona, Catalonian: Un equip internacional de climatòlegs estudia com es poden minimitzar errades en les tendències climàtiques observades

URV university, Spanish: Un equipo internacional de climatólogos estudia cómo se pueden minimizar errores en las tendencias climáticas observadas

URV university, English: An international team of climatologists is studying how to minimise errors in observed climate trends

Articles

Tarragona 21: Climatòlegs de la URV estudien com es poden minimitzar errades en les tendències climàtiques observades

Genius Science, French: Une équipe de climatologues étudie comment minimiser les erreurs dans la tendance climatique observée

Phys.org: A team of climatologists is studying how to minimize errors in observed climate trend

 

Friday, May 1, 2020

Statistical homogenization under-corrects any station network-wide trend biases

Photo of a station of the US Climate Reference Network with a prominent wind shield for the rain gauges.
A station of the US Climate Reference Network.


In the last blog post I made the argument that the statistical detection of breaks in climate station data has problems when the noise is larger than the break signal. The post before argued that the best homogenization correction method we have can remove network-wide trend biases perfectly if all breaks are known. In the light of the last post, we naturally would like to know how well this correction method can remove such biases in the more realistic case when the breaks are imperfectly estimated. That should still be studied much better, but it is interesting to discuss a number of other studies on the removal of network-wide trend biases from the perspective of this new understanding.

So this post will argue that it theoretically makes sense that (unavoidable) inaccuracies of break detection lead to network-wide trend biases only being partially corrected by statistical homogenization.

1) We have seen this in our study of the correction method in response to small errors in the break positions (Lindau and Venema, 2018).

2) The benchmarking study of NOAA’s homogenization algorithm shows that if the breaks are big and easy they are largely removed, while in the scenario where breaks are plentiful and small half of the trend bias remains (Williams et al., 2012).

3) Another benchmarking study show that with the network density of Switzerland homogenization can find and remove clear trend biases, while if you thin this network to be similar to Peru the bias cannot be removed (Gubler et al., 2017).

4) Finally, a benchmarking study of relative humidity station observations in Austria could not remove much of the trend bias, which is likely because relative humidity is not correlated well from station to station (Chimani et al., 2018).

Statistical homogenization on a global scale makes warming estimates larger (Lawrimore et al., 2011; Menne et al., 2018). Thus if it can only remove part of any trend bias, this would mean that quite likely the actual warming was larger.


Figure 1: The inserted versus remaining network-mean trend error. Upper panel for perfect breaks. Lower panel for a small perturbation of the break position. The time series are 100 annual values and have 5 break. Figure 10 in Lindau and Venema (2018).

Joint correction method

First, what did our study on the correction method (Lindau and Venema, 2018) say about the importance of errors in the break position? As the paper was mostly about perfect breaks, we assumed that all breaks were known, but that they had a small error in their position. In the example to the right, we perturbed the break position by a normally distributed random number with standard deviation one (lower panel), while for comparison the breaks are perfect (upper panel).

In both cases we inserted a large network-wide trend bias of 0.873 °C over the length of the century long time series. The inserted errors for 1000 simulations is on the x-axis, the average inserted trend bias is denoted by x̅. The remaining error after homogenization is on the y-axis. Its average is denoted by y̅ and basically zero in case the breaks are perfect (top panel). In case of the small perturbation (lower panel) the average remaining error is 0.093 °C, this is 11 % of the inserted trend bias. That is the under-correction for is a quite small perturbation: 38 % of the positions is not changed at all.

If the standard deviation of the position perturbation is increased to 2, the remaining trend bias is 21 % of the inserted bias.

In the upper panel, there is basically no correlation between the inserted and the remaining error. That is, the remaining error does not depend on the break signal, but only on the noise. In the lower panel with the position errors, there is a correlation between the inserted and remaining trend error. So in this more realistic case, it does matter how large the trend bias due to the inhomogeneities is.

This is naturally an idealized case, position errors will be more complicated in reality and there would be spurious and missing breaks. But this idealized case fitted best to the aim of the paper of studying the correction algorithm in isolation.

It helps understand where the problem lies. The correction algorithm is basically a regression that aims to explain the inserted break signal (and the regional climate signal). Errors in the predictors will lead to an explained variance that is less than 100 %. One should thus expect that the estimated break signal is smaller than the actual break signal. It is thus expected that the trend change due to the estimated break signal produces is smaller than the actual trend change due to the inhomogeneities.

NOAA’s benchmark

That statistical homogenization under-corrects when the going gets tough is also found by the benchmarking study of NOAA’s Pairwise Homogenization Algorithm in Williams et al. (2012). They simulated temperature networks like the American USHCN network and added inhomogeneities according to a range of scenarios. (Also with various climate change signals.) Some scenarios were relatively easy, had few and large breaks, while others were hard and contained many small breaks. The easy cases were corrected nearly perfectly with respect to the network-wide trend, while in the hard cases only half of the inserted network-wide trend error was removed.

The results of this benchmarking for the three scenarios with a network-wide trend bias are shown below. The three panels are for the three scenarios. Each panel has results (the crosses, ignore the box plots) for three periods over which the trend error was computed. The main message is that the homogenized data (orange crosses) lies between the inhomogeneous data (red crosses) and the homogeneous data (green crosses). Put differently, green is how much the climate actually changed, red is how much the estimate is wrong due to inhomogeneities, orange shows that homogenization moves the estimate towards the truth, but never fully gets there.

If we use the number of breaks and their average size as a proxy for the difficulty of the scenario, the one on the left has 6.4 breaks with an average size of 0.8 °C, the one in the middle 8.4 breaks (size 0.4 °C) and the one on the right 10 breaks (size 0.4 °C). So this suggests there is a clear dose effect relationship; although there surely is more than just the number of breaks.


Figures from Williams et al. (2012) showing the results for three scenarios. This is a figure I created from parts of Figure 7 (left), Figure 5 (middle) and Figure 10 (right; their numbers).

When this study appeared in 2012, I found the scenario with the many small breaks much too pessimistic. However, our recent study estimating the properties of the inhomogeneities of the American network found a surprisingly large number of breaks: more than 17 per century; they were bigger: 0.5 °C. So purely based on the number of breaks the hardest scenario is even optimistic, but also size matters.

Not that I would already like to claim that even in a dense network like the American there is a large remaining trend bias and the actual warming was much larger. There is more to the difficulty of inhomogeneities than their number and size. It sure is worth studying.

Alpine benchmarks

The other two examples in the literature I know of are examples of under-correction in the sense of basically no correction because the problem is simply too hard. Gubler et al. (2017) shows that the raw data of the Swiss temperature network has a clear trend bias, which can be corrected with homogenization of its dense network (together with metadata), but when they thin the network to a network density similar to that of Peru, they are unable to correct this trend bias. For more details see my review of this article in the Grassroots Review Journal on Homogenization.

Finally, Chimani et al. (2018) study the homogenization of daily relative humidity observations in Austria. I made a beautiful daily benchmark dataset, it was a lot of fun: on a daily scale you have autocorrelations and a distribution with an upper and lower limit, which need to be respected by the homogeneous data and the inhomogeneous data. But already the normal homogenization of monthly averages was much too hard.

Austria has quite a dense network, but relative humidity is much influenced by very local circumstances and does not correlate well from station to station. My co-authors of the Austrian weather service wanted to write about the improvements: "an improvement of the data by homogenization was non‐ideal for all methods used". For me the interesting finding was: nearly no improvement was possible. That was unexpected. Had we expected that we could have generated a much simpler monthly or annual benchmark to show no real improvement was possible for humidity data and saved us a lot of (fun) work.

What does this mean for global warming estimates?

When statistical homogenization only partially removes large-scale trend biases what does this mean for global warming estimates? In the global temperature datasets statistical homogenization leads to larger warming estimates. So if we tend to underestimate how much correction is needed, this would mean that the Earth most likely warmed up more than current estimates indicate. How much exactly is hard to tell at the moment and thus needs a nuanced discussion. Let me give you my considerations in the next post.


Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization

References

Chimani Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122. https://doi.org/10.1002/joc.5488.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683. https://doi.org/10.1002/joc.5114

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russell S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal Geophysical Research, 116, D19121. https://doi.org/10.1029/2011JD016187

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: https://eartharxiv.org/r57vf/, paywalled article: https://doi.org/10.1002/joc.5728

Menne, Matthew J., Claude N. Williams, Byron E. Gleason, Jared J. Rennie and Jay H. Lawrimore, 2018: The Global Historical Climatology Network Monthly Temperature Dataset, Version 4. Journal of Climate, 31, 9835–9854.
https://doi.org/10.1175/JCLI-D-18-0094.1

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116. https://doi.org/10.1029/2011JD016761

Monday, February 24, 2020

Estimating the statistical properties of inhomogeneities without homogenization

One way to study inhomogeneities is to homogenize a dataset and study the corrections made. However, that way you only study the inhomogeneities that have been detected. Furthermore, it is always nice to have independent lines of evidence in an observational science. So in this recently published study Ralf Lindau and I (2019) set out to study the statistical properties of inhomogeneities directly from the raw data.

Break frequency and break size

The description of inhomogeneities can be quite complicated.

Observational data contains both break inhomogeneities (jumps due to, for example, a change of instrument or location) and gradual inhomogeneities (for example, due to degradation of the sensor or the instrument screen, growing vegetation or urbanization). The first simplification we make is that we only consider break inhomogeneities. Gradual inhomogeneities are typically homogenized with multiple breaks and they are often quite hard to distinguish from actual multiple breaks in case of noisy data.

When it comes to the year and month of the break we assume every date has the same probability of containing a break. It could be that when there is a break, it is more likely that there is another break, or less likely that there is another break.* It could be that some periods have a higher probability of having a break or the beginning of a series could have a different probability or when there is a break in station X, there could be a larger chance of a break in station Y. However, while some of these possibilities make intuitively sense, we do not know about studies on them, so we assume the simplest case of independent breaks. The frequency of these breaks is a parameter our method will estimate.

* When you study the statistical properties of breaks detected by homogenization methods, you can see that around a break it is less likely for there to be another break. One reason for this is that some homogenization methods explicitly exclude the possibility of two nearby breaks. The methods that do allow for nearby breaks will still often prefer the simpler solution of one big break over two smaller ones.


When it comes to the sizes of the breaks we are reasonably confident that they follow a normal distribution. Our colleagues Menne and Williams (2005) computed the break sizes for all dates where the station history suggested something happened to the measurement that could affect its homogeneity.** They found the break size distribution plotted below. The graph compares the histogram to a normal distribution with an average of zero. Apart from the actual distribution not having a mean of zero (leading to trend biases) it seems to be a decent match and our method will assume that break sizes have a normal distribution.


Figure 1. Histogram of break sizes for breaks known from station histories (metadata).


** When you study the statistical properties of breaks detected by homogenization methods the distribution looks different; the graph plotted below is a typical example. You will not see many small breaks; the middle of the normal distribution is missing. This is because these small breaks are not statistically significant in a noisy time series. Furthermore, you often see some really large breaks. These are likely multiple breaks being detected as one big one. Using breaks known from the metadata, as Menne and Williams (2005) did, avoids or reduces these problems and is thus a better estimate of the distribution of actual breaks in climate data. Although, you can always worry that the breaks not known in the metadata are different. Science never ends.



Figure 2. Histogram of detected break sizes for the lower USA.

Temporal behavior

The break frequency and size is still not a complete description of the break signal, there is also the temporal dependence of the inhomogeneities. In the HOME benchmark I had assumed that every period between two breaks had a shift up or down determined by a random number, what we call “Random Deviation from a baseline” in the new article. To be honest, “assumed” means I had not really thought about it when generating the data. In the same year, NOAA published a benchmark study where they assumed that the jumps up and down (and not the levels) were given by a random number, that is, they assumed the break signal is a random walk. So we have to distinguish between levels and jumps.

This makes quite a difference for the trend errors. In case of Random Deviations, if the first jump goes up it is more likely that the next jump goes down, especially if the first jump goes up a lot. In case of a random walk or Brownian Motion, when the first jump goes up, this does not influence the next jump and it has a 50% probability of also going up. Brownian Motion hence has a tendency to run away, when you insert more breaks, the variance of the break signal keeps going up on average, while Random Deviations are bounded.

The figure from another new paper (Lindau and Venema, 2020) shown below quantifies the big difference this makes for the trend error of a typical 100 years long time series. On the x-axis you see the frequency of the breaks (in breaks per century) and on the y-axis the variance of the trends (in Kelvin2 or Celsius2 per century2) these breaks produce.

The plus-symbols are for the case of Random Deviations from a baseline. If you have exactly two breaks per time series this gives the largest trend error. However, because the number of breaks varies, an average break frequency of about three breaks per series gives the largest trend error. This makes sense as no breaks would give no trend error, while in case of more and more breaks you average over more and more independent numbers and the trend error becomes smaller and smaller.

The circle-symbols are for Brownian Motion. Here the variance of the trends increases linearly with the number of breaks. For a typical number of breaks of more than five, Brownian Motion produces a much larger trend error than Random Deviations.


Figure 3. Figure from Lindau and Venema (2020) quantifying the trend errors due to break inhomogeneities. The variance of the jump sizes is the same in both cases: 1 °C2.

One of our colleagues, Peter Domonkos, also sometimes uses Brownian Motion, but puts a limit on how far it can run away. Furthermore, he is known for the concept of platform-like inhomogeneity pairs, where if the first break goes up, the next one is more likely to go down (or the other way around) thus building a platform.

All of these statistical models can make physical sense. When a measurement error causes the observation to go up (or down), once this problem is discovered it will go down (or up) again, thus creating a platform inhomogeneity pair. When the first break goes up (or down) because of a relocation, this perturbation remains when the the sensor is changed and both remain when the screen is changed, thus creating a random walk. Relocations are a frequent reason for inhomogeneities. When the station Bonn is relocated, the operator will want to keep it in the region, thus searching in a random directions around Bonn, rather than around the previous location. That would create Random Deviations.

In the benchmarking study HOME we looked at the sign of consecutive detected breaks (Venema et al., 2012). In case of Random Deviations, like HOME used for its simulated breaks, you would expect to get platform break pairs (first break up and the second down, or reversed) in 4 of 6 cases (67%). We detected them in 63% of the cases, a bit less, probably showing that platform pairs are a bit harder to detect than two breaks going in the same direction. In case of Brownian Motion you would expect 50% platform break pairs. For the real data in the HOME benchmark the percentage of platforms was 59%. So this does not fit to Brownian Motion, but is lower than you would expect from Random Deviations. Reality seems to be somewhere in the middle.

So for our new study estimating the statistical properties of inhomogeneities we opted for a statistical model where the breaks are described by a Random Deviations (RD) signal added to a Brownian Motion (BM) signal and estimate their parameters to see how large these two components are.

The observations

To estimate the properties of the inhomogeneities we have monthly temperature data from a large number of stations. This data has a regional climate signal, observational and weather noise and inhomogeneities. To separate the noise and the inhomogeneities we can use the fact that they are very different with respect to their temporal correlations. The noise will be mostly independent in time or weakly correlated in as far as measurement errors depend on the weather. The inhomogeneities, on the other hand, have correlations over many years.

However, the regional climate signal also has correlations over many years and is comparable in size to the break signal. So we have opted to work with a difference time series, that is, subtracting the time series of a neighboring station from that of a candidate station. This mostly removes the complicated climate signal and what remains is two times the inhomogeneities and two times the noise. The map below shows the 1459 station pairs we used for the USA.


Figure 4. Map of the lower USA with all the pairs of stations we used in this study.

For estimating the inhomogeneities, the climate signal is noise. By removing it we reduce the noise level and avoid having to make assumptions about the regional climate signal. There are also disadvantages to working with the difference series, inhomogeneities that are in both the candidate and the reference series will be (partially) removed. For example, when there is a jump because of the way the temperature is computed this leads to a change in the entire network***. Such a jump would be mostly invisible in a difference series. Although not fully invisible because the jump size will be different in every station.


*** In the past the temperature was read multiple times a day or a minimum and maximum temperature thermometer was used. With labor-saving automatic weather stations we can now sample the temperature many times a day and changing from one definition to another will give a jump.

Spatiotemporal differences

As test statistic we have chosen the variance of the spatiotemporal differences. The “spatio” part of the differences I already explained, we use the difference between two stations. Temporal differences mean we subtract two numbers separated by a time lag. For all pairs of stations and all possible pairs of values with a certain lag, we compute the variance of all these difference values and do this for lags of zero to 80 years.

In the paper we do all the math to show how the three components (noise, Random Deviation and Brownian Motion) depend on the lag. The noise does not depend on the lag. It is constant. Brownian Motion produces a linear increase of the variance as a function of lag, while the Random Deviations produce a saturating exponential function. How fast the function saturates is a function of the number of breaks per century.

The variance of the spatiotemporal differences for America is shown below. The O-symbols are the variances computed from the data. The other lines are the fits for the various parts of the statistical model. The variance of the noise is about 0.62 Kelvin2 or Celsius2 and shown as a horizontal line as it does not depend on the lag. The component of the Brownian Motion is the line indicated by BM, while the Random Deviation (RD) component is the curve starting at the origin and growing to about 0.47 K2. From how fast this curve growths we estimate that the American data has one RD break every 5.8 years.

The curve for Brownian Motion being a line already suggests that it is not possible to estimate how many BM breaks the time series contains, we only know the total variance, but not whether it is contained in many small ones or one big one.



Figure 5. The variance of the spatiotemporal differences as a function of the time lag for the lower USA.

The situation for Germany is a bit different; see figure below. Here we do not see the continual linear increase in the variance we had above for America. Apparently the break signal in Germany does not have a significant Brownian Motion component and only contains Random Deviation breaks. The number of breaks is also much smaller, the German data only has one break every 24 years. The German weather service seems to give undisturbed climate observations a high priority.

For both countries the size of the RD breaks is about the same and quite small, expressed as typical jump size it would be about 0.5°C.



Figure 6. The variance of the spatiotemporal differences as a function of the time lag L for Germany.

The number of detected breaks

The number of breaks we found for America is a lot larger than the number of breaks detected by statistical homogenization. Typical numbers for detected breaks are one per 15 years for America and one per 20 years for Europe, although it also depends considerably on the homogenization method applied.

I was surprised by the large difference between actual breaks and detected breaks, I thought we would maybe miss 20 to 25% of the breaks. If you look at the histograms of the detected breaks, such as Figure 2 reprinted below, where the middle is missing, it looks as if about 20% is missing in a country with a dense observational network.

But these histograms are not a good way to determine what is missing. Next to the influence of chance, small breaks may be detected because they have a good reference station and other breaks are far away, while relatively big breaks may go undetected because of other nearby breaks. So there is not a clear cut-off and you would have to go far from the middle to find reliably detected breaks, which is where you get into the region where there are too many large breaks because detection algorithms combined two or more breaks into one. In other words, it is hard to estimate how many breaks are missing by fitting a normal distribution to the histogram of the detected breaks.

If you do the math, as we do in Section 6 of the article, it is perfectly possible not to detect half of the breaks even for a dense observational network.


Figure 2. Histogram of detected break sizes for the lower USA.

Final thoughts

This is a new methodology, let’s see how it holds when others look at it, with new methods, other assumptions about the nature of inhomogeneities and other datasets. Separating Random Deviations and Brownian Motion requires long series. We do not have that many long series and you can already see in the figures above that the variance of the spatiotemporal differences for Germany is quite noisy. The method thus requires too much data to apply it to networks all over the world.

In Lindau and Venema (2018) we introduced a method to estimate the break variance and the number of breaks for a single pair of stations (but not BM vs RD). This needed some human inspection to ensure the fits were right, but it does suggest that there may be a middle ground, a new method which can estimate these parameters for smaller amounts of data, which can be applied world wide.

The next blog post will be about the trend errors due to these inhomogeneities. If you have any questions about our work, do leave a comment below.


Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization

References

Lindau, R, Venema, V., 2020: Random trend errors in climate station data due to inhomogeneities. International Journal Climatology, 40, pp. 2393-2402. Open Access. https://doi.org/10.1002/joc.6340

Lindau, R, Venema, V., 2019: A new method to study inhomogeneities in climate records: Brownian motion or random deviations? International Journal Climatology, 39: p. 4769– 4783. Manuscript. https://eartharxiv.org/vjnbd/ https://doi.org/10.1002/joc.6105

Lindau, R. and Venema, V.K.C., 2018: The joint influence of break and noise variance on the break detection capability in time series homogenization. Advances in Statistical Climatology, Meteorology and Oceanography, 4, p. 1–18. https://doi.org/10.5194/ascmo-4-1-2018

Menne, M.J. and C.N. Williams, 2005: Detection of Undocumented Changepoints Using Multiple Test Statistics and Composite Reference Series. Journal of Climate, 18, 4271–4286. https://doi.org/10.1175/JCLI3524.1

Menne, M.J., C.N. Williams, and R.S. Vose, 2009: The U.S. Historical Climatology Network Monthly Temperature Data, Version 2. Bulletin American Meteorological Society, 90, 993–1008. https://doi.org/10.1175/2008BAMS2613.1

Venema, V., O. Mestre, E. Aguilar, I. Auer, J.A. Guijarro, P. Domonkos, G. Vertacnik, T. Szentimrey, P. Stepanek, P. Zahradnicek, J. Viarre, G. Müller-Westermeier, M. Lakatos, C.N. Williams, M.J. Menne, R. Lindau, D. Rasol, E. Rustemeier, K. Kolokythas, T. Marinova, L. Andresen, F. Acquaotta, S. Fratianni, S. Cheval, M. Klancar, M. Brunetti, Ch. Gruber, M. Prohom Duran, T. Likso, P. Esteban, Th. Brandsma, 2012: Benchmarking homogenization algorithms for monthly data. Climate of the Past, 8, pp. 89-115. https://doi.org/10.5194/cp-8-89-2012

Monday, April 27, 2015

Two new reviews of the homogenization methods used to remove non-climatic changes

By coincidence this week two initiatives have been launched to review the methods to remove non-climatic changes from temperature data. One initiative was launched by the Global Warming Policy Foundation (GWPF), a UK free-market think tank. The other by the Task Team on Homogenization (TT-HOM) of the Commission for Climatology (CCl) of the World meteorological organization (WMO). Disclosure: I chair the TT-HOM.

The WMO is one of the oldest international organizations and has meteorological and hydrological services in almost all countries of the world as its members. The international exchange of weather data has always been important for understanding the weather and to make weather predictions. The main role of the WMO is to provide guidance and to define standards that make collaboration easier. The CCl coordinates climate research, especially when it comes to data measured by national weather services.

The review on homogenization, which the TT-HOM will write, is thus mainly aimed at helping national weather services produce better quality datasets to study climate change. This will allow weather services to provide better climate services to help their nations adapt to climate change.

Homogenization

Homogenization is necessary because much has happened in the world between the French and industrial revolutions, two world wars, the rise and fall of communism, and the start of the internet age. Inevitably many changes have occurred in climate monitoring practices. Many global datasets start in 1880, the year toilet paper was invented in the USA and 3 decades before the Ford Model T.

As a consequence, the instruments used to measure temperature have changed, the screens to protect the sensors from the weather have changed and the surrounding of the stations has often been changed and stations have been moved in response. These non-climatic changes in temperature have to be removed as well as possible to make more accurate assessments of how much the world has warmed.

Removing such non-climatic changes is called homogenization. For the land surface temperature measured at meteorological stations, homogenization is normally performed using relative statistical homogenizing methods. Here a station is compared to its neighbours. If the neighbour is sufficiently nearby, both stations should show about the same climatic changes. Strong jumps or gradual increases happening at only one of the stations indicate a non-climatic change.

If there is a bias in the trend, statistical homogenization can reduce it. How well trend biases can be removed depends on the density of the network. In industrialised countries a large part of the bias can be removed for the last century. In developing countries and in earlier times removing biases is more difficult and a large part may remain. Because many governments unfortunately limit the exchange of climate data, the global temperature collections can also remove only part of the trend biases.

Some differences

Some subtle differences. The Policy Foundation has six people from the UK, Canada and the USA, who do not work on homogenization. The WMO team has nine people who work on homogenization from Congo, Pakistan, Peru, Canada, the USA, Australia, Hungary, Germany, and Spain.

The TT-HOM team has simply started outlining their report. The Policy Foundation creates spin before they have results with publications in their newspapers and blogs and they showcase that they are biased to begin with when they write [on their homepage]:
But only when the full picture is in will it be possible to see just how far the scare over global warming has been driven by manipulation of figures accepted as reliable by the politicians who shape our energy policy, and much else besides. If the panel’s findings eventually confirm what we have seen so far, this really will be the “smoking gun”, in a scandal the scale and significance of which for all of us can scarcely be exaggerated.
My emphasis. Talk about hyperbole by the click-whore journalists of the Policy Foundation. Why buy newspapers when their articles are worse than a random page on the internet? The Policy Foundation gave their team a very bad start.

Aims of the Policy Foundation

Hopefully, the six team members of the Policy Foundation will realise just how naive and loaded the questions they were supposed to answer are. The WMO has asked us whether we as TT-HOM would like to update our Terms of Reference; we are the experts after all. I hope the review team will update theirs, as that would help them to be seen as scientists seriously interested in improving science. Their current terms of reference are printed in italics below.

The panel is asked to examine the preparation of data for the main surface temperature records: HadCRUT, GISS, NOAA and BEST. For this reason the satellite records are beyond the scope of this inquiry.

I fail to see the Policy Foundation asking something without arguments as a reason.

The satellite record is the most adjusted record of them all. The raw satellite data does not show much trend at all and initially even showed a cooling trend. Much of the warming in this uncertain and short dataset is thus introduced the moment the researchers remove the non-climatic changes (differences between satellites, drifts in their orbits and the height of the satellites, for example). A relatively small error in these adjustments, thus quickly leads to large trend errors.

While independent studies for the satellite record are sorely missing, a blind validation study for station data showed that homogenization methods work. They reduce any temperature trend biases a dataset may have for reasonable scenarios. For this blind validation study we produced a dataset that mimics a real climate network with known non-climatic changes, so that we knew what the answer should be. We have a similar blind validation of the method used by NOAA to homogenize its global land surface data.

The following questions will be addressed.

1. Are there aspects of surface temperature measurement procedures that potentially impair data quality or introduce bias and need to be critically re-examined?

Yes. A well-known aspect is the warming bias due to urbanization. This has been much studied and was found to produce only a small warming bias. A likely reason is that urban stations are regularly relocated to less urban locations.

On the other hand, the reasons for a cooling bias in land temperatures have been studied much too little. In a recent series, I mention several reasons why current measurements are cooler than those in the past: changes in thermometer screens, relocations and irrigation. At this time we cannot tell how important each of these individual reasons is. Any of these reasons is potentially important enough to explain the 0.2°C per century cooling trend bias found in the GHNv3 land temperatures. The reasons mentioned above could together explain a much larger cooling trend bias, which could dramatically change our assessment of the progress of global warming.

2. How widespread is the practice of adjusting original temperature records? What fraction of modern temperature data, as presented by HadCRUT/GISS/NOAA/BEST, are actual original measurements, and what fraction are subject to adjustments?

Or as Nick Stokes put it "How widespread is the practice of doing arithmetic?" (hat tip HotWhopper.)

Almost all longer station measurement series contain non-climatic changes. There is about one abrupt non-climatic change every 15 to 20 years. I know of two long series that are thought to be homogeneous: Potsdam in Germany and Mohonk Lake, New York, USA. There may be a few more. If you know more please write a comment below.

It is pretty amazing that the Policy Foundation knows so little about climate data that it asked its team to answer such a question. A question everyone working on the topic could have answered. A question that makes most sense when seen as an attempt to deceive the public and insinuate that there are problems.

3. Are warming and cooling adjustments equally prevalent?

Naturally not.

If we were sure that warming and cooling adjustments were of the same size, there would be no need to remove non-climatic changes from climate data before computing a global mean temperature signal.

It is known in the scientific literature that the land temperatures are adjusted upwards and the ocean temperatures are adjusted downwards.

It is pretty amazing that the Policy Foundation knows so little about climate data that it asked its team to answer such a question. A question everyone working on the topic could have answered. A question that makes most sense when seen as an attempt to deceive the public and insinuate that there are problems.

4. Are there any regions of the world where modifications appear to account for most or all of the apparent warming of recent decades?

The adjustments necessary for the USA land temperatures happen to be large, about 0.4°C.

That is explained by two major transitions: a change in the time of observation from afternoons to mornings (about 0.2°C) and the introduction of automatic weather stations (AWS), which in the USA happens to have produced a cooling bias of 0.2°C. (The bias due to the introduction of AWS depends on the design of the AWS and the local climate and thus differs a lot from network to network.)

The smaller the meteorological network or region you consider, the larger the biases you can find. Many of them average out on a global scale.

5. Are the adjustment procedures clearly documented, objective, reproducible and scientifically defensible? How much statistical uncertainty is introduced with each step in homogeneity adjustments and smoothing?

The adjustments to the global datasets are objective and reproducible. These datasets are so large that there is no option other than processing them automatically.

The GHCN raw land temperatures are published, the processing software is published, everyone can repeat it. The same goes for BEST and GISS. Clearly documented and defensible are matters if opinion and this can always be improved. But if the Policy Foundation is not willing to read the scientific literature, clear documentation does not help much.

Statistical homogenization reduces the uncertainty of large-scale trends. Another loaded question.

Announcement full of bias and errors

Also the article [by Christopher Booker in the Telegraph and reposted by the Policy Foundation] announcing the review by the Policy Foundation is full of errors.

Booker: The figures from the US National Oceanic and Atmospheric Administration (NOAA) were based, like all the other three official surface temperature records on which the world’s scientists and politicians rely, on data compiled from a network of weather stations by NOAA’s Global Historical Climate Network (GHCN).

No, the Climate Research Unit and BEST gather data themselves. They do also use GHCN land surface data, but would certainly notice if that data showed more or less global warming than their other data sources.

Also the data published by national weather services show warming. If someone assumes a conspiracy, it would be a very large one. Real conspiracies tend to be small and short.

Booker: But here there is a puzzle. These temperature records are not the only ones with official status. The other two, Remote Sensing Systems (RSS) and the University of Alabama (UAH), are based on a quite different method of measuring temperature data, by satellites. And these, as they have increasingly done in recent years, give a strikingly different picture.

The long-term trend is basically the same. The satellites see much stronger variability due to El Nino, which make them better suited for cherry picking short periods, if one is so inclined.

Booker:In particular, they will be wanting to establish a full and accurate picture of just how much of the published record has been adjusted in a way which gives the impression that temperatures have been rising faster and further than was indicated by the raw measured data.

None of the studies using the global mean temperature will match this criterion because contrary to WUWT-wisdom the adjustments reduce the temperature trend, which gives the "impression" that temperatures have been rising more slowly and less than was indicated by the raw measured data.

The homepage of the Policy Foundation team shows a graph for the USA (in Fahrenheit), reprinted below. This is an enormous cherry pick. The adjustments necessary for the USA land temperatures happen to be large and warming, about 0.4°C. The reasons for this were explained above in the answer to GWPF question 4.



That the US non-climatic changes are large relative to other regions should be known to somewhat knowledgeable people. Presented without context on the homepage of the Policy Foundation and The Telegraph, it will fool the casual reader by suggesting that this is typical.

[UPDATE. I have missed one rookie mistake. Independent expert Zeke Hausfather says: Its a bad sign that this new effort features one graph on their website: USHCN version 1 adjusted minus raw. Unfortunately, USHCN v1 was replaced by USHCN v2 (with the automated PHA rather than manual adjustments) about 8 years ago. The fact that they are highlighting an old out-of-date adjustment graph is, shall we say, not a good sign.]

For the global mean temperature, the net effect of all adjustments is a reduction in the warming. The raw records show a stronger warming due to non-climatic changes, which climatologists reduce by homogenization.

Thus what really happens is the opposite of what happens to the USA land temperatures shown by the Policy Foundation. They do not show this because it does not fit their narrative of activist scientists, but this is the relevant temperature record with which to assess the magnitude of global warming and thus the relevant adjustment.



Previous reviews

I am not expecting serious journalists to write about this. [UPDATE. Okay, I was wrong about that.] Maybe later, when the Policy Foundation shows their results and journalists can ask independent experts for feedback. However, just in case, here is an overview of real work to ascertain the quality of the station temperature trend.

In a blind validation study we showed that homogenization methods reduce any temperature trend biases for reasonable scenarios. For this blind validation study we produced a dataset that mimics a real climate network. Into this data we inserted known non-climatic changes, so that we knew what the answer should be and could judge how well the algorithms work. It is certainly possible to make a scenario in which the algorithms would not work, but to the best of our understanding such scenarios would be very unrealistic.

We have a similar blind validation of the method used by NOAA to homogenize its global land surface data.

The International Surface Temperature Initiative (ISTI) has collected a large dataset with temperature observations. It is now working on a global blind validation dataset, with which we will not only be able to say that homogenization methods improve trend estimates, but also to get a better numerical estimate of by how much. (In more data sparse regions in developing countries, the methods probably cannot improve the trend estimate much, the previous studies were for Europe and the USA).

Then we have BEST by physicist Richard Muller and his group of non-climatologists who started working on the quality of station data. They basically found the same result as the mainstream climatologists. This group actually put in work and developed an independent method to estimate the climatic trends, rather than just do a review. The homogenization method from this group was also applied to the NOAA blind validation dataset and produced similar results.

We have the review and guidance of the World Meteorological Organization on homogenization from 2003. The review of the Task Team on Homogenization will be an update of this classic report.

Research priorities

The TT-HOM has decided to focus on monthly mean data used to establish global warming. Being a volunteer effort we do not have the resources to tackle the more difficult topic of changes to extremes in detail. If someone has some money to spare, that is where I would do a review. That is a seriously difficult topic where we do not know well how accurately we can remove non-climatic problems.

And as mentioned above, a good review of the satellite microwave temperature data would be very valuable. Satellite data is affected by strong non-climatic changes and almost its entire trend is due to homogenization adjustments; a relatively small error in the adjustments thus quickly leads to large changes in their trend estimates. At the same time I do not know of a (blind) validation study nor of an estimate of the uncertainty in satellite temperature trends.

If someone has some money to spare, I hope it is someone interested in science, no matter the outcome, and not a Policy Foundation with an obvious stealth agenda, clearly interested in a certain outcome. It is good that we have science foundations and universities to fund most of the research; funders who are interested in the quality of the research rather than the outcome.

The interest is appreciated. Homogenization is too much of a blind spot in climate science. As Neville Nicholls, one of the heroes of the homogenization community, writes:
When this work began 25 years or more ago, not even our scientist colleagues were very interested. At the first seminar I presented about our attempts to identify the biases in Australian weather data, one colleague told me I was wasting my time. He reckoned that the raw weather data were sufficiently accurate for any possible use people might make of them.
One wonders how this colleague knew this without studying it.

In theory it is nice that some people find homogenization so important as to do another review. It would be better if those people were scientifically interested. The launch party of the Policy Foundation suggests that they are interested in spin, not science. The Policy Foundation review team will have to do a lot of work to recover from this launch party. I would have resigned.

[UPDATE 2019: The GWPF seems to have stopped paying for their PR page about their "review", https://www.tempdatareview.org/. It now hosts Chinese advertisements for pills. I am not aware of anything coming out of the "review", no report, no summary of the submitted comments written by volunteers in their free time for the GWPF, no article. If you thought this was a PR move to attack science from the start, you may have had a point.]


Related reading

Just the facts, homogenization adjustments reduce global warming

HotWhopper must have a liberal billionaire and a science team behind her. A great, detailed post: Denier Weirdness: A mock delegation from the Heartland Institute and a fake enquiry from the GWPF

William M. Connolley gives his candid take at Stoat: Two new reviews of the homogenization methods used to remove non-climatic changes

Nick Stokes: GWPF inquiring into temperature adjustments

And Then There's physics: How many times do we have to do this?

The Independent: Leading group of climate change deniers accused of creating 'fake controversy' over claims global temperature data may be inaccurate

Phil Plait at Bad Astronomy comment on the Telegraph piece: No, Adjusting Temperature Measurements Is Not a Scandal

John Timmer at Ars Technica is also fed up with being served the same story about some upward adjusted stations every year: Temperature data is not “the biggest scientific scandal ever” Do we have to go through this every year?

The astronomer behind the blog "And Then There's Physics" writes why the removal of non-climatic effects makes sense. In the comments he talks about adjustments made to astronomical data. Probably every numerical observational discipline of science performs data processing to improve the accuracy of their analysis.

Steven Mosher, a climate "sceptic" who has studied the temperature record in detail and is no longer sceptical about that reminds of all the adjustments demanded by the "sceptics".

Nick Stokes, an Australian scientist, has a beautiful post that explains the small adjustments to the land surface temperature in more detail.

Statistical homogenisation for dummies

A short introduction to the time of observation bias and its correction

New article: Benchmarking homogenisation algorithms for monthly data

Bob Ward at the Guardian: Scepticism over rising temperatures? Lord Lawson peddles a fake controversy

Wednesday, October 8, 2014

A framework for benchmarking of homogenisation algorithm performance on the global scale - Paper now published

By Kate Willett reposted from the Surface Temperatures blog of the International Surface Temperature Initiative (ISTI).

The ISTI benchmarking working group have just had their first benchmarking paper accepted at Geoscientific Instrumentation, Methods and Data Systems:

Willett, K., Williams, C., Jolliffe, I. T., Lund, R., Alexander, L. V., Brönnimann, S., Vincent, L. A., Easterbrook, S., Venema, V. K. C., Berry, D., Warren, R. E., Lopardo, G., Auchmann, R., Aguilar, E., Menne, M. J., Gallagher, C., Hausfather, Z., Thorarinsdottir, T., and Thorne, P. W.: A framework for benchmarking of homogenisation algorithm performance on the global scale, Geosci. Instrum. Method. Data Syst., 3, 187-200, doi:10.5194/gi-3-187-2014, 2014.

Benchmarking, in this context, is the assessment of homogenisation algorithm performance against a set of realistic synthetic worlds of station data where the locations and size/shape of inhomogeneities are known a priori. Crucially, these inhomogeneities are not known to those performing the homogenisation, only those performing the assessment. Assessment of both the ability of algorithms to find changepoints and accurately return the synthetic data to its clean form (prior to addition of inhomogeneity) has three main purposes:

1) quantification of uncertainty remaining in the data due to inhomogeneity
2) inter-comparison of climate data products in terms of fitness for a specified purpose
3) providing a tool for further improvement in homogenisation algorithms

Here we describe what we believe would be a good approach to a comprehensive homogenisation algorithm benchmarking system. Thfis includes an overarching cycle of: benchmark development; release of formal benchmarks; assessment of homogenised benchmarks and an overview of where we can improve for next time around (Figure 1).

Figure 1 Overview the ISTI comprehensive benchmarking system for assessing performance of homogenisation algorithms. (Fig. 3 of Willett et al., 2014)

There are four components to creating this benchmarking system.

Creation of realistic clean synthetic station data
Firstly, we must be able to synthetically recreate the 30000+ ISTI stations such that they have the correct variability, auto-correlation and interstation cross-correlations as the real data but are free from systematic error. In other words, they must contain a realistic seasonal cycle and features of natural variability (e.g., ENSO, volcanic eruptions etc.). There must be a realistic persistence month-to-month in each station and geographically across nearby stations.

Creation of realistic error models to add to the clean station data
The added inhomogeneities should cover all known types of inhomogeneity in terms of their frequency, magnitude and seasonal behaviour. For example, inhomogeneities could be any or a combination of the following:

- geographically or temporally clustered due to events which affect entire networks or regions (e.g. change in observation time);
- close to end points of time series;
- gradual or sudden;
- variance-altering;
- combined with the presence of a long-term background trend;
- small or large;
- frequent;
- seasonally or diurnally varying.

Design of an assessment system
Assessment of the homogenised benchmarks should be designed with the three purposes of benchmarking in mind. Both the ability to correctly locate changepoints and to adjust the data back to its homogeneous state are important. It can be split into four different levels:

- Level 1: The ability of the algorithm to restore an inhomogeneous world to its clean world state in terms of climatology, variance and trends.

- Level 2: The ability of the algorithm to accurately locate changepoints and detect their size/shape.

- Level 3: The strengths and weaknesses of an algorithm against specific types of inhomogeneity and observing system issues.

- Level 4: A comparison of the benchmarks with the real world in terms of detected inhomogeneity both to measure algorithm performance in the real world and to enable future improvement to the benchmarks.

The benchmark cycle
This should all take place within a well laid out framework to encourage people to take part and make the results as useful as possible. Timing is important. Too long a cycle will mean that the benchmarks become outdated. Too short a cycle will reduce the number of groups able to participate.

Producing the clean synthetic station data on the global scale is a complicated task that has now taken several years but we are close to completion of a version 1. We have collected together a list of known regionwide inhomogeneities and a comprehensive understanding of the many many different types of inhomogeneities that can affect station data. We have also considered a number of assessment options and decided to focus on levels 1 and 2 for assessment within the benchmark cycle. Our benchmarking working group is aiming for release of the first benchmarks by January 2015.

Tuesday, November 26, 2013

Are break inhomogeneities a random walk or a noise?

Tomorrow is the next conference call of the benchmarking and assessment working group (BAWG) of the International Surface Temperature Initiative (ISTI; Thorne et al., 2011). The BAWG will create a dataset to benchmark (validate) homogenization algorithm. It will mimic the real mean temperature data of the ISTI, but will include know inhomogeneities, so that we can assess how well the homogenization algorithms remove them. We are almost finished discussing how the benchmark dataset should be developed, but still need to fix some details. Such as the question: Are break inhomogeneities a random walk or a noise?

Previous studies

The benchmark dataset of the ISTI will be global and is also intended to be used to estimate uncertainties in the climate signal due to remaining inhomogeneities. These are the two main improvements over previous validation studies.

Williams, Menne, and Thorne (2012) validated the pairwise homogenization algorithm of NOAA on a dataset mimicking the US Historical Climate Network. The paper focusses on how well large-scale biases can be removed.

The COST Action HOME has performed a benchmarking of several small networks (5 to 19 stations) realistically mimicking European climate networks (Venema et al., 2012). It main aim was to intercompare homogenization algorithms, the small networks allowed HOME to also test manual homogenization methods.

These two studies were blind, in other words the scientists homogenizing the data did not know where the inhomogeneities were. An interesting coincidence is that the people who generated the blind benchmarking data were outsiders at the time: Peter Thorne for NOAA and me for HOME. This probably explains why we both made an error, which we should not repeat in the ISTI.

Friday, March 29, 2013

Special issue on homogenisation of climate series

The open access Quarterly Journal of the Hungarian Meteorological Service "Időjárás" has just published a special issue on homogenization of climate records. This special issue contains eight research papers. It is an offspring of the COST Action HOME: Advances in homogenization methods of climate series: an integrated approach (COST-ES0601).

To be able to discuss eight papers, this post does not contain as much background information as usual and is aimed at people already knowledgeable about homogenization of climate networks.

Contents

Mónika Lakatos and Tamás Szentimrey: Editorial.
The editorial explains the background of this special issue: the importance of homogenisation and the COST Action HOME. Mónika and Tamás thank you very much for your efforts to organise this special issue. I think every reader will agree that it has become a valuable journal issue.

Monthly data

Ralf Lindau and Victor Venema: On the multiple breakpoint problem and the number of significant breaks in homogenization of climate records.
My article with Ralf Lindau is already discussed in a previous post on the multiple breakpoint problem.
José A. Guijarro: Climatological series shift test comparison on running windows.
Longer time series typically contain more than one inhomogeneity, but statistical tests are mostly designed to detect one break. One way to resolve this conflict is by applying these tests on short moving windows. José compares six statistical detection methods (t-test, Standard Normal Homogeneity Test (SNHT), two-phase regression (TPR), Wilcoxon-Mann-Whithney test, Durbin-Watson test and SRMD: squared relative mean difference), which are applied on running windows with a length between 1 and 5 years (12 to 60 values (months) on either side of the potential break). The smart trick of the article is that all methods are calibrated to a false alarm rate of 1% for better comparison. In this way, he can show that the t-test, SNHT and SRMD are best for this problem and almost identical. To get good detection rates, the window needs to be at least 2*3 years. As this harbours the risk of having two breaks in one window, José has decided to change his homogenization method CLIMATOL to using the semi-hierarchical scheme of SNHT instead of using windows. The methods are tested on data with just one break; it would have been interesting to also simulate the more realistic case with multiple independent breaks.
Olivier Mestre, Peter Domonkos, Franck Picard, Ingeborg Auer, Stéphane Robin, Emilie Lebarbier, Reinhard Böhm, Enric Aguilar, Jose Guijarro, Gregor Vertachnik, Matija Klan-car, Brigitte Dubuisson, and Petr Stepanek: HOMER: a homogenization software – methods and applications.
HOMER is a new homogenization method and is developed using the best methods tested on the HOME benchmark. Thus theoretically, this should be the best method currently available. Still, sometimes interactions between parts of an algorithm can lead to unexpected results. It would be great if someone would test HOMER on the HOME benchmark dataset, so that we can compare its performance with the other algorithms.

Thursday, October 4, 2012

Beta version of a new global temperature database released

Today, a first version of the global temperature dataset of the International Surface Temperature Initiative (ISTI) with 39 thousand stations has been released. The aim of the initiative is to provide an open and transparent temperature dataset for climate research.

The database is designed as a climate "sceptic" wet dream: the entire processing of the data will be performed with automatic open software. This includes every processing step from conversion to standard units, to merging stations to longer series, to quality control, homogenisation, gridding and computation of regional and global means. There will thus be no opportunity for evil climate scientists to fudge the data and create an artificially strong temperature trend.

It is planned that in many cases, you can go back to the digital images of the books or cards on which the observer noted down the temperature measurements. This will not be possible for all data. Many records have been keyed directly in the past, without making digital images. Sometimes the original data is lost, for instance in case of Austria, where the original daily observation have been lost in the Second World War and only the monthly means are still available from annual reports.

The ISTS also has a group devoted to data rescue to encourage people to go into the archives, image and key in the observations and upload this information to the database.


Tuesday, January 10, 2012

New article: Benchmarking homogenisation algorithms for monthly data

The main paper of the COST Action HOME on homogenisation of climate data has been published today in Climate of the Past. This post describes shortly the problem of inhomogeneities in climate data and how such data problems are corrected by homogenisation. The main part explains the topic of the paper, a new blind validation study of homogenisation algorithms for monthly temperature and precipitation data. All the most used and best algorithms participated.

Inhomogeneities

To study climatic variability the original observations are indispensable, but not directly usable. Next to real climate signals they may also contain non-climatic changes. Corrections to the data are needed to remove these non-climatic influences, this is called homogenisation. The best known non-climatic change is the urban heat island effect. The temperature in cities can be warmer than on the surrounding country side, especially at night. Thus as cities grow, one may expect that temperatures measured in cities become higher. On the other hand, many stations have been relocated from cities to nearby, typically cooler, airports. Other non-climatic changes can be caused by changes in measurement methods. Meteorological instruments are typically installed in a screen to protect them from direct sun and wetting. In the 19th century it was common to use a metal screen on a North facing wall. However, the building may warm the screen leading to higher temperature measurements. When this problem was realised the so-called Stevenson screen was introduced, typically installed in gardens, away from buildings. This is still the most typical weather screen with its typical double-louvre door and walls. Nowadays automatic weather stations, which reduce labor costs, are becoming more common; they protect the thermometer by a number of white plastic cones. This necessitated changes from manually recorded liquid and glass thermometers to automated electrical resistance thermometers, which reduces the recorded temperature values.



One way to study the influence of changes in measurement techniques is by making simultaneous measurements with historical and current instruments, procedures or screens. This picture shows three meteorological shelters next to each other in Murcia (Spain). The rightmost shelter is a replica of the Montsouri screen, in use in Spain and many European countries in the late 19th century and early 20th century. In the middle, Stevenson screen equipped with automatic sensors. Leftmost, Stevenson screen equipped with conventional meteorological instruments.
Picture: Project SCREEN, Center for Climate Change, Universitat Rovira i Virgili, Spain.


A further example for a change in the measurement method is that the precipitation amounts observed in the early instrumental period (about before 1900) are biased and are 10% lower than nowadays because the measurements were often made on a roof. At the time, instruments were installed on rooftops to ensure that the instrument is never shielded from the rain, but it was found later that due to the turbulent flow of the wind on roofs, some rain droplets and especially snow flakes did not fall into the opening. Consequently measurements are nowadays performed closer to the ground.

Sunday, January 8, 2012

What distinguishes a benchmark?

Benchmarking is a community effort

Science has many terms for studying the validity or performance of scientific methods: testing, validation, intercomparison, verification, evaluation, and benchmarking. Every term has a different, sometimes subtly different, meaning. Initially I had wanted to compare all these terms with each other, but that would have become a very long post, especially as the meaning for every term is different in business, engineering, computation and science. Therefore, this post will only propose a definition for benchmarking in science and what distinguishes it from other approaches, casually called other validation studies from now on.

In my view benchmarking has three distinguishing features.
1. The methods are tested blind.
2. The problem is realistic.
3. Benchmarking is a community effort.
The term benchmark has become fashionable lately. It is also used, however, for validation studies that do not display these three features. This is not wrong, as there is no generally accepted definition of benchmarking. In fact in an important article on benchmarking by Sim et al. (2003) defines "a benchmark as a test or set of tests used to compare the performance of alternative tools or techniques." which would include any validation study. Then they limit the topic of their article, however, to interesting benchmarks, which are "created and used by a technical research community." However, if benchmarking is used for any type of validation study, there would not be any added value to the word. Thus I hope this post can be a starting point for a generally accepted and a more restrictive definition.