Friday, 29 May 2020

What does statistical homogenization tell us about the underestimated global warming over land?

Climate station data contains inhomogeneities, which are detected and corrected by comparing a candidate station to its neighbouring reference stations. The most important inhomogeneities are the ones that lead to errors in the station network-wide trends and in global trend estimates. 

An earlier post in this series argued that statistical homogenization will tend to under-correct errors in the network-wide trends in the raw data. Simply put: that some of the trend error will remain. The catalyst for this series is the new finding that when the signal to noise ratio is too low, homogenization methods will have large errors in the positions of the jumps/breaks. For much of the earlier data and for networks in poorer countries this probably means that any trend errors will be seriously under-corrected, if they are corrected at all.

The questions for this post are: 1) What do the corrections in global temperature datasets do to the global trend and 2) What can we learn from these adjustments for global warming estimates?

The global warming trend estimate

In the global temperature station datasets statistical homogenization leads to larger warming estimates. So as we tend to underestimate how much correction is needed, this suggests that the Earth warmed up more than current estimates indicate.

Below is the warming estimate in NOAA’s Global Historical Climate Network (Versions 3 and 4) from Menne et al. (2018). You see the warming in the “raw data” (before homogenization; striped lines) and in the homogenized data (drawn line). The new version 4 is drawn in black, the previous version 3 in red. For both versions homogenization makes the estimated warming larger.

After homogenization the warming estimates of the two versions are quite similar. The difference is in the raw data. Version 4 is based on the raw data of the International Surface Temperature Initiative and has much more stations. Version 3 had many stations that report automatically, these are typically professional stations and a considerable part of them are at airports. One reason the raw data may show less warming in Version 3 is that many stations at airports were in cities before. Taking them out of the urban heat island and often also improving the local siting of the station, may have produced a systematic artificial cooling in the raw observations.

Version 4 has more stations and thus a higher signal to noise ratio. One may thus expect it to show more warming. That this is not the case is a first hint that the situation is not that simple, as explained at the end of this post.

Figure from Menne et al. with warming estimates from 1880. See caption below.
The global land warming estimates based on the Global Historical Climate Network dataset of NOAA. The red lines are for version 3, the black lines for the new version 4. The striped lines are before homogenization and the drawn lines after homogenization. Figure from Menne et al. (2018).

The difference due to homogenization in the global warming estimates is shown in the figure below, also from Menne et al. (2018). The study also added an estimate for the data of the Berkeley Earth initiative.

(Background information. Berkeley Earth started as a US Culture War initiative where non-climatologists computed the observed global warming. Before the results were in, climate “sceptics” claimed their methods were the best and they would accept any outcome. The moment the results turned out to be scientifically correct, but not politically correct, the climate “sceptics” dropped them like a hot potato.)

We can read from the figure that in GHCNv3 over the full period homogenization increases warming estimates by about 0.3 °C per century, while this is 0.2°C in GHCNv4 and 0.1°C in the dataset of Berkeley Earth datasets. GHCNv3 has more than 7000 stations (Lawrimore et al., 2011). GHCNv4 is based on the ISTI dataset (Thorne et al., 2011), which has about 32,000 stations, but GHCN only uses those of at least 10 years and thus contains about 26,000 stations (Menne et al. 2018). Berkeley Earth is based on 35,000 stations (Rohde et al., 2013).

Figure from Menne et al. (2018) showing how much adjustments were made.
The difference due to homogenization in the global warming estimates (Menne et al., 2018). The red line is for smaller GHCNv3 dataset, the black line for GHCNv4 and the blue line for Berkeley Earth.

What does this mean for global warming estimates?

So, what can we learn from these adjustments for global warming estimates? At the moment, I am afraid, not yet a whole lot. However, the sign is quite likely right. If we could do a perfect homogenization, I expect that this would make the warming estimates larger. But to estimate how large the correction should have been based on the corrections which were actually made in the above datasets is difficult.

In the beginning, I was thinking: if the signal to noise ratio in some network is too low, we may be able to estimate that in such a case we under-correct, say, 50% and then make the adjustments unbiased by making them, say, twice as large.

However, especially doing this globally is a huge leap of faith.

The first assumption this would make is that the trend bias in data sparse regions and periods is the same as that of data rich regions and periods. However, the regions with high station density are in the [[mid-latitudes]] where atmospheric measurements are relatively easy. The data sparse periods are also the periods in which large changes in the instrumentation were made as we were still learning how to make good meteorological observations. So we cannot reliably extrapolate from data rich regions and periods to data sparse regions and periods. 

Furthermore, there will not be one correction factor to account for under-correction because the signal to noise ratio is different everywhere. Maybe America is only under-corrected by 10% and needs just a little nudge to make the trend correction unbiased. However, homogenization adjustments in data sparse regions may only be able to correct such a small part of the trend bias that correcting for the under-correction becomes adventurous or even will make trend estimates more uncertain. So we would at least need to make such computations for many regions and periods.

Finally, another reason not to take such an estimate too seriously are the spatial and temporal characteristics of the bias. The signal to noise ratio is not the only problem. One would expect that it also matters how the network-wide trend bias is distributed over the network. In case of relocations of city stations to airports, a small number of stations will have a large jump. Such a large jump is relatively easy to detect, especially as its neighbouring stations will mostly be unaffected.

Already a harder case is the time of observation bias in America, where a large part of the stations has experienced a cooling shift from afternoon to morning measurements over many decades. Here, in most cases the neighbouring stations were not affected around the same time, but the smaller shift makes it harder to detect these breaks.

(NOAA has a special correction for this problem, but when it is turned off statistical homogenization still finds the same network-wide trend. So for this kind of bias the network density in America is apparently sufficient.)

Among the hardest case are changes in the instrumentation. For example, the introduction of Automatic Weather Stations in the last decades or the introduction of the Stevenson screen a century ago. These relatively small breaks often happen over a period of only a few decades, if not years, which means that also the neighbouring stations are affected. That makes it hard to detect them in a difference time series.

Studying from the data how the biases are distributed is hard. One could study this by homogenizing the data and studying the breaks, but the ones which are difficult to detect will then be under-represented. This is a tough problem; please leave suggestions in the comments.

Because of how the biases are distributed it is perfectly possible that the trend biases corrected in GHCN and Berkley Earth are due to the easy-to-correct problems, such as the relocations to airports, while the hard ones, such as the transition to Stevenson screens, are hardly corrected. In this case, the correction that could be made, do not provide information on the ones that could not be made. They have different causes and different difficulties.

So if we had a network where the signal to noise ratio is around one, we could not say that the under-correction is, say, 50%. One would have to specify for which kind of distribution of the bias this is valid.

GHCNv3, GHCNv4 and Berkeley Earth

Coming back to the trend estimates of GHCN version 3 and version 4. One may have expected that version 4 is able to better correct trend biases, having more stations, and should thus show a larger trend than version 3. This would go even more so for Berkeley Earth. But the final trend estimates are quite similar. Similarly in the most data rich period after the second world war, the least corrections are made.

The datasets with the largest number of stations showing the strongest trend would have been a reasonable expectation if the trend estimates of the raw data would have been similar. But these raw data trends are the reason for the differences in the size of the corrections, while the trend estimates based on the homogenized are quite similar.

Many additional stations will be in regions and periods where we already had many stations and where the station density was no problem. On the other hand, adding some stations to data sparse regions may not be sufficient to fix the low signal to noise ratio. So the most improvements would be expected for the moderate cases where the signal to noise ratio is around one. Until we have global estimates of the signal to noise ratio for these datasets, we do not know for which percentage of stations this is relevant, but this could be relatively small.

The arguments of the previous section are also applicable here; the relationship between station density and adjustments may not be that easy. Especially that the corrections in the period after the second world war are so small is suspicious; we know quite a lot happened to the measurement networks. Maybe these effects all average out, but that would be quite a coincidence. Another possibility is that these changes in observational methods were made over relatively short periods to entire networks making it hard to correct them.

A reason for the similar outcomes for the homogenized data could be that all datasets successfully correct for trend biases due to problems like the transition to airports, while for every dataset the signal to noise ratio is not enough to correct problems like the transition to Stevenson screens. GHNCv4 and Berkeley Earth using as many stations as they could find could well have more stations which are currently badly sited than GHCNv3, which was more selective. In that case the smaller effective corrections of these two datasets would be due to compensating errors.

Finally, as small disclaimer: The main change from version 3 to 4 was the number of stations, but there were other small changes, so it is not just a comparison of two datasets where only the signal to noise ratio is different. Such a pure comparison still needs to be made. The homogenization methods of GHCN and Berkeley Earth are even more different.

My apologies for all the maybe's and could be's, but this is something that is more complicated than it may look and I would not be surprised if it will turn out to be impossible to estimate how much corrections are needed based on the corrections that are made by homogenization algorithms. The only thing I am confident about is that homogenization improves trend estimates, but I am not confident about how much it improves.

Parallel measurements

Another way to study these biases in the warming estimates is to go into the books and study station histories in 200 plus countries. This is basically how sea surface temperature records are homogenized. To do this for land stations is a much larger project due to the large number of countries and languages.

Still there are such experiments, which give a first estimate for some of the biases when it comes to the global mean temperature (do not expect regional detail). In the next post I will try to estimate the missing warming this way. We do not have much data from such experiments yet, but I expect that this will be the future.

Other posts in this series


Chimani, Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683.

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russel S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal of Geophysical Research, 116, D19121.

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: Article:

Rohde, Robert, Richard A. Muller, Robert Jacobsen, Elizabeth Muller, Saul Perlmutter, Arthur Rosenfeld, Jonathan Wurtele, Donald Groom and Charlotte Wickham, 2013: A New Estimate of the Average Earth Surface Land Temperature Spanning 1753 to 2011. Geoinformatics & Geostatistics: An Overview, 1, no.1.

Sutton, Rowan, Buwen Dong and Jonathan Gregory, 2007: Land/sea warming ratio in response to climate change: IPCC AR4 model results and comparison with observations. Geophysical Research Letters, 34, L02701.

Thorne, Peter W., Kate M. Willett, Rob J. Allan, Stephan Bojinski, John R. Christy, Nigel Fox, Simon Gilbert, Ian Jolliffe, John J. Kennedy, Elizabeth Kent, Albert Klein Tank, Jay Lawrimore, David E. Parker, Nick Rayner, Adrian Simmons, Lianchun Song, Peter A. Stott and Blair Trewin, 2011: Guiding the creation of a comprehensive surface temperature resource for twenty-first century climate science. Bulletin American Meteorological Society, 92, ES40–ES47.

Wallace, Craig and Manoj Joshi, 2018: Comparison of land–ocean warming ratios in updated observed records and CMIP5 climate models. Environmental Research Letters, 13, no. 114011. 

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116.

No comments: