Showing posts with label joint correction. Show all posts
Showing posts with label joint correction. Show all posts

Monday, October 12, 2020

The deleted chapter of the WMO Guidance on the homogenisation of climate station data

The Task Team on Homogenization (TT-HOM) of the Open Panel of CCl Experts on Climate Monitoring and Assessment (OPACE-2) of the Commission on Climatology (CCl) of the World Meteorological Organization (WMO) has published their Guidance on the homogenisation of climate station data.

The guidance report was a bit longish, so at the end we decided that the last chapter on "Future research & collaboration needs" was best deleted. As chair of the task team and as someone who likes tp dream about what others could do in a comfy chair, I wrote most of this chapter and thus we decided to simply make it a blog post for this blog. Enjoy.

Introduction

This guidance is based on our current best understanding of inhomogeneities and homogenisation. However, writing it also makes clear there is a need for a better understanding of the problems.

A better mathematical understanding of statistical homogenisation is important because that is what most of our work is based on. A stronger mathematical basis is a prerequisite for future methodological improvements.

A stronger focus on a (physical) understanding of inhomogeneities would complement and strengthen the statistical work. This kind of work is often performed at the station or network level, but also needed at larger spatial scales. Much of this work is performed using parallel measurements, but they are typically not internationally shared.

In an observational science the strength of the outcomes depends on a consilience of evidence. Thus having evidence on inhomogeneities from both statistical homogenisation and physical studies strengthens the science.

This chapter will discuss the needs for future research on homogenisation grouped in five kinds of problems. In the first section we will discuss research on improving our physical understanding and physics-based corrections. The next section is about break detection, especially about two fundamental problems in statistical homogenisation: the inhomogeneous-reference problem and the multiple-breakpoint problem.

Next write about computing uncertainties in trends and long-term variability estimates from homogenised data due to remaining inhomogeneities. It may be possible to improve correction methods by treating it as a statistical model selection problem. The last section discusses whether inhomogeneities are stochastic or deterministic and how that may affect homogenisation and especially correction methods for the variability around the long-term mean.

For all the research ideas mentioned below, it is understood that in future we should study more meteorological variables than temperature. In addition, more studies on inhomogeneities across variables could be helpful to understand the causes of inhomogeneities and increase the signal to noise ratio. Homogenisation by national offices has advantages because here all climate elements from one station are stored together. This helps in understanding and identifying breaks. It would help homogenisation science and climate analysis to have a global database for all climate elements, like iCOADS for marine data. A Copernicus project has started working on this for land station data, which is an encouraging development.

Physical understanding

It is a good scientific practice to perform parallel measurements in order to manage unavoidable changes and to compare the results of statistical homogenisation to the expectations given the cause of the inhomogeneity according to the metadata. This information should also be analysed on continental and global scales to get a better understanding of when historical transitions took place and to guide homogenisation of large-scale (global) datasets. This requires more international sharing of parallel data and standards on the reporting of the size of breaks confirmed by metadata.

The Dutch weather service KNMI published a protocol how to manage possible future changes of the network, who decides what needs to be done in which situation, what kind of studies should be made, where the studies should be published and that the parallel data should be stored in their central database as experimental data. A translation of this report will soon be published by the WMO (Brandsma et al., 2019) and will hopefully inspire other weather services to formalise their network change management.

Next to statistical homogenisation, making and studying parallel measurements, and other physical estimates, can provide a second line of evidence on the magnitude of inhomogeneities. Having multiple lines of evidence provides robustness to observational sciences. Parallel data is especially important for the large historical transitions that are most likely to produce biases in network-wide to global climate datasets. It can validate the results of statistical homogenisation and be used to estimate possibly needed additional adjustments. The Parallel Observations Science Team of the International Surface Temperature Initiative (ISTI-POST) is working on building such a global dataset with parallel measurements.

Parallel data is especially suited to improve our physical understand of the causes of inhomogeneities by studying how the magnitude of the inhomogeneity depends on the weather and on instrumental design characteristics. This understanding is important for more accurate corrections of the distribution, for realistic benchmarking datasets to test our homogenisation methods and to determine which additional parallel experiments are especially useful.

Detailed physical models of the measurement, for example, the flow through the screens, radiative transfer and heat flows, can also help gain a better understanding of the measurement and its error sources. This aids in understanding historical instruments and in designing better future instruments. Physical models will also be paramount for understanding the impact of the surrounding on the measurement — nearby obstacles and surfaces influencing error sources and air flow — to changes in the measurand, such as urbanisation/deforestation or the introduction of irrigation. Land-use changes, especially urbanisation, should be studied together with relocations they may provoke.

Break detection

Longer climate series typically contain more than one break. This so-called multiple-breakpoint problem is currently an important research topic. A complication of relative homogenisation is that also the reference stations can have inhomogeneities. This so-called inhomogeneous-reference problem is not optimally solved yet. It is also not clear what temporal resolution is best for detection and what the optimal way is to handle the seasonal cycle in the statistical properties of climate data and of many inhomogeneities.

For temperature time series about one break per 15 to 20 years is typical and multiple breaks are thus common. Unfortunately, most statistical detection methods have been developed for one break and for the null hypothesis of white (sometimes red) noise. In case of multiple breaks the statistical test should not only take the noise variance into account, but also the break variance from breaks at other positions. For low signal to noise ratios, the additional break variance can lead to spurious detections and inaccuracies in the break position (Lindau and Venema, 2018a).

To apply single-breakpoint tests on series with multiple breaks, one ad-hoc solution is to first split the series at the most significant break (for example, the standard normalised homogeneity test, SNHT) and investigate the subseries. Such a greedy algorithm does not always find the optimal solution. Another solution is to detect breaks on short windows. The window should be short enough to contain only one break, which reduces power of detection considerably. This method is not used much nowadays.

Multiple breakpoint methods can find an optimal solution and are nowadays numerically feasible. This can be done in a hypothesis testing (MASH) or in a statistical model selection framework. For a certain number of breaks these methods find the break combination that minimize the internal variance, that is variance of the homogeneous subperiods, (or you could also state that the break combination maximizes the variance of the breaks). To find the optimal number of breaks, a penalty is added that increases with the number of breaks. Examples of such methods are PRODIGE (Caussinus & Mestre, 2004) or ACMANT (based on PRODIGE; Domonkos, 2011b). In a similar line of research Lu et al. (2010) solved the multiple breakpoint problem using a minimum description length (MDL) based information criterion as penalty function.

This penalty function of PRODIGE was found to be suboptimal (Lindau and Venema, 2013). It was found that the penalty should be a function of the number of breaks, not fixed per break and that the relation with the length of the series should be reversed. It is not clear yet how sensitive homogenisation methods respond to this, but increasing the penalty per break in case of low SNR to reduce the number of breaks does not make the estimated break signal more accurate (Lindau and Venema, 2018a).

Not only the candidate station, also the reference stations will have inhomogeneities, which complicates homogenisation. Such inhomogeneities can be climatologically especially important when they are due to network-wide technological transitions. An example of such a transition is the current replacement of temperature observations using Stevenson screens by automatic weather stations. Such transitions are important periods as they may cause biases in the network and global average trends and they produce many breaks over a short period.

A related problem is that sometimes all stations in a network have a break at the same date, for example, when a weather service changes the time of observation. Nationally such breaks are corrected using metadata. If this change is unknown in global datasets one can still detect and correct such inhomogeneities statistically by comparison with other nearby networks. That would require an algorithm that additionally knows which stations belong to which network and prioritizes correcting breaks found between stations in different networks. Such algorithms do not exist yet and information on which station belongs to which network for which period is typically not internationally shared.

The influence of inhomogeneities in the reference can be reduced by computing composite references over many stations, removing reference stations with breaks and by performing homogenisation iteratively.

A direct approach to solving this problem would be to simultaneously homogenise multiple stations, also called joint detection. A step in this direction are pairwise homogenisation methods where breaks are detected in the pairs. This requires an additional attribution step, which attributes the breaks to a specific station. Currently this is done by hand (for PRODIGE; Caussinus and Mestre, 2004; Rustemeier et al., 2017) or with ad-hoc rules (by the Pairwise homogenisation algorithm of NOAA; Menne and Williams, 2009).

In the homogenisation method HOMER (Mestre et al., 2013) a first attempt is made to homogenise all pairs simultaneously using a joint detection method from bio-statistics. Feedback from first users suggests that this method should not be used automatically. It should be studied how good this methods works and where the problems come from.

Multiple breakpoint methods are more accurate as single breakpoint methods. This expected higher accuracy is founded on theory (Hawkins, 1972). In addition, in the HOME benchmarking study it was numerically found that modern homogenisation methods, which take the multiple breakpoint and the inhomogeneous reference problems into account, are about a factor two more accurate as traditional methods (Venema et al., 2012).

However, the current version of CLIMATOL applies single-breakpoint detection tests, first SNHT detection on a window then splitting, to achieve results comparable to modern multiple-breakpoint methods with respect to break detection and homogeneity of the data (Killick, 2016). This suggests that the multiple-breakpoint detection principle may not be as important as previously thought and warrants deeper study or the accuracy of CLIMATOL is partly due to an unknown unknown.

The signal to noise ratio is paramount for the reliable detection of breaks. It would thus be valuable to develop statistical methods that explain part of the variance of a difference time series and remove this to see breaks more clearly. Data from (regional) reanalysis could be useful predictors for this.

First methods have been published to detect breaks for daily data (Toreti et al., 2012; Rienzner and Gandolfi, 2013). It has not been studied yet what the optimal resolution for breaks detection is (daily, monthly, annual), nor what the optimal way is to handle the seasonal cycle in the climate data and exploit the seasonal cycle of inhomogeneities. In the daily temperature benchmarking study of Killick (2016) most non-specialised detection methods performed better than the daily detection method MAC-D (Rienzner and Gandolfi, 2013).

The selection of appropriate reference stations is a necessary step for accurate detection and correction. Many different methods and metrics are used for the station selection, but studies on the optimal method are missing. The knowledge of local climatologists which stations have a similar regional climate needs to be made objective so that it can be applied automatically (at larger scales).

For detection a high signal to noise ratio is most important, while for correction it is paramount that all stations are in the same climatic region. Typically the same networks are used for both detection and correction, but it should be investigated whether a smaller network for correction would be beneficial. Also in general, we need more research on understanding the performance of (monthly and daily) correction methods.

Computing uncertainties

  • Also after homogenisation uncertainties remain in the data due to various problems: Not all breaks in the candidate station have been and can be detected.

  • False alarms are an unavoidable trade-off for detecting many real breaks.

  • Uncertainty in the estimation of correction parameters due to limited data.

  • Uncertainties in the corrections due to limited information on the break positions.

From validation and benchmarking studies we have a reasonable idea about the remaining uncertainties that one can expect in the homogenised data, at least with respect to changes in the long-term mean temperature. For many other variables and changes in the distribution of (sub-)daily temperature data individual developers have validated their methods, but systematic validation and comparison studies are still missing.

Furthermore, such studies only provide a general uncertainty level, whereas more detailed information for every single station/region and period would be valuable. The uncertainties will strongly depend on the signal to noise ratios, on the statistical properties of the inhomogeneities of the raw data and on the quality and cross-correlations of the reference stations. All of which vary strongly per station, region and period.

Communicating such a complicated errors structure, which is mainly temporal, but also partially spatial, is a problem in itself. Furthermore, not only the uncertainty in the means should be considered, but, especially for daily data, uncertainties in the complete probability density function need to be estimated and communicated. This could be communicated with an ensemble of possible realisations, similar to Brohan et al. (2006).

An analytic understanding of the uncertainties is important, but is often limited to idealised cases. Thus also numerical validation studies, such as the past HOME and upcoming ISTI studies are important for an assessment of homogenisation algorithms under realistic conditions.

Creating validation datasets also help to see the limits of our understanding of the statistical properties of the break signal. This is especially the case for variables other than temperature and for daily and (sub-)daily data. Information is needed on the real break frequencies and size distributions, but also their auto-correlations and cross-correlations, as well as explained in the next section the stochastic nature of breaks in the variability around the mean.

Validation studies focussed on difficult cases would be valuable for a better understanding. For example, sparse networks, isolated island networks, large spatial trend gradients and strong decadal variability in the difference series of nearby stations (for example, due to El Nino in complex mountainous regions).

The advantage of simulated data is that it can create a large number of quite realistic complete networks. For daily data it will remain hard for the years to come to determine how to generate a realistic validation dataset. Thus even if using parallel measurements is mostly limited to one break per test, it does provide the highest degree of realism for this one break.

Deterministic or stochastic corrections?

Annual and monthly data is normally used to study trends and variability in the mean state of the atmosphere. Consequently, typically only the mean is adjusted by homogenisation. Daily data, on the other hand is used to study climatic changes in weather variability, severe weather and extremes. Consequently, not only the mean should be corrected, but the full probability distribution describing the variability of the weather.

The physics of the problem suggests that many inhomogeneities are caused by stochastic processes. An example affecting many instruments are differences in the response time of instruments, which can lead to differences determined by turbulence. A fast thermometer will on average read higher maximum temperatures than a slow one, but this difference will be variable and sometimes be much higher than the average. In case of errors due to insolation the radiation error will be modulated by clouds. An insufficiently shielded thermometer will need larger corrections on warm days, which will typically be more sunny, but some warm days will be cloudy and not need much correction, while other warm days are sunny and calm and have a dry hot surface. The adjustment of daily data for studies on changes in the variability is thus a distribution problem and not only a regression bias-correction problem. For data assimilation (numerical weather prediction) accurate bias correction (with regression methods) is probably the main concern.

Seen as a variability problem, the correction of daily data is similar to statistical downscaling in many ways. Both methodologies aim to produce bias-corrected data with the right variability, taking into account the local climate and large-scale circulation. One lesson from statistical downscaling is that increasing the variance of a time series deterministically by multiplication with a fraction, called inflation, is the wrong approach and that the variance that could not be explained by regression using predictors should be added stochastically as noise instead (Von Storch, 1999). Maraun (2013) demonstrated that the inflation problem also exists for the deterministic Quantile Matching method, which is also used in daily homogenisation. Current statistical correction methods deterministically change the daily temperature distribution and do not stochastically add noise.

Transferring ideas from downscaling to daily homogenisation is likely fruitful to develop such stochastic variability correction methods. For example, predictor selection methods from downscaling could be useful. Both fields require powerful and robust (time invariant) predictors. Multi-site statistical downscaling techniques aim at reproducing the auto- and cross-correlations between stations (Maraun et al., 2010), which may be interesting for homogenisation as well.

The daily temperature benchmarking study of Rachel Killick (2016) suggests that current daily correction methods are not able to improve the distribution much. There is a pressing need for more research on this topic. However, these methods likely also performed less well because they were used together with detection methods with a much lower hit rate than the comparison methods.

The deterministic correction methods may not lead to severe errors in homogenisation, that should still be studied, but stochastic methods that implement the corrections by adding noise would at least theoretically fit better to the problem. Such stochastic corrections are not trivial and should have the right variability on all temporal and spatial scales.

It should be studied whether it may be better to only detect the dates of break inhomogeneities and perform the analysis on the homogeneous subperiods (removing the need for corrections). The disadvantage of this approach is that most of the trend variance is in the difference in the mean of the HSPs and only a small part is in the trend within the HPSs. In case of trend analysis, this would be similar to the work of the Berkeley Earth Surface Temperature group on the mean temperature signal. Periods with gradual inhomogeneities, e.g., due to urbanisation, would have to be detected and excluded from such an analysis.

An outstanding problem is that current variability correction methods have only been developed for break inhomogeneities, methods for gradual ones are still missing. In homogenisation of the mean of annual and monthly data, gradual inhomogeneities are successfully removed by implementing multiple small breaks in the same direction. However, as daily data is used to study changes in the distribution, this may not be appropriate for daily data as it could produce larger deviations near the breaks. Furthermore, changing the variance in data with a trend can be problematic (Von Storch, 1999).

At the moment most daily correction methods correct the breaks one after another. In monthly homogenisation it is found that correcting all breaks simultaneously (Caussinus and Mestre, 2004) is more accurate (Domonkos et al., 2013). It is thus likely worthwhile to develop multiple breakpoint correction methods for daily data as well.

Finally, current daily correction methods rely on previously detected breaks and assume that the homogeneous subperiods (HSP) are homogeneous (i.e., each segment between breakpoints assume to be homogeneous) . However, these HSP are currently based on detection of breaks in the mean only. Breaks in higher moments may thus still be present in the "homogeneous" sub periods and affect the corrections. If only for this reason, we should also work on detection of breaks in the distribution.

Correction as model selection problem

The number of degrees of freedom (DOF) of the various correction methods varies widely. From just one degree of freedom for annual corrections of the means, to 12 degrees of freedom for monthly correction of the means, to 40 for decile corrections applied to every season, to a large number of DOF for quantile or percentile matching.

A study using PRODIGE on the HOME benchmark suggested that for typical European networks monthly adjustment are best for temperature; annual corrections are probably less accurate because they fail to account for changes in seasonal cycle due to inhomogeneities. For precipitation annual corrections were most accurate; monthly corrections were likely less accurate because the data was too noisy to estimate the 12 correction constants/degrees of freedom.

What is the best correction method depends on the characteristics of the inhomogeneity. For a calibration problem just the annual mean could be sufficient, for a serious exposure problem (e.g., insolation of the instrument) a seasonal cycle in the monthly corrections may be expected and the full distribution of the daily temperatures may need to be adjusted. The best correction method also depends on the reference. Whether the variables of a certain correction model can be reliably estimated depends on how well-correlated the neighbouring reference stations are.

An entire regional network is typically homogenised with the same correction method, while the optimal correction method will depend on the characteristics of each individual break and on the quality of the reference. These will vary from station to station, from break to break and from period to period. Work on correction methods that objectively select the optimal correction method, e.g., using an information criterion, would be valuable.

In case of (sub-)daily data, the options to select from become even larger. Daily data can be corrected just for inhomogeneities in the mean (e.g., Vincent et al., 2002, where daily temperatures are corrected by incorporating a linear interpolation scheme that preserves the previously defined monthly corrections) or also for the variability around the mean. In between are methods that adjust for the distribution including the seasonal cycle, which dominates the variability and is thus effectively similar to mean adjustments with a seasonal cycle. Correction methods of intermediate complexity with more than one, but less than 10 degrees of freedom would fill a gap and allow for more flexibility in selecting the optimal correction model.

When applying these methods (Della-Marta and Wanner, 2006; Wang et al., 2010; Mestre et al., 2011; Trewin, 2013) the number of quantile bins (categories) needs to be selected as well as whether to use physical weather-dependent predictors and the functional form they are used (Auchmann and Brönnimann, 2012). Objective optimal methods for these selections would be valuable.

Related information

WMO Guidelines on Homogenization (English, French, Spanish) 

WMO guidance report: Challenges in the Transition from Conventional to Automatic Meteorological Observing Networks for Long-term Climate Records


Friday, May 1, 2020

Statistical homogenization under-corrects any station network-wide trend biases

Photo of a station of the US Climate Reference Network with a prominent wind shield for the rain gauges.
A station of the US Climate Reference Network.


In the last blog post I made the argument that the statistical detection of breaks in climate station data has problems when the noise is larger than the break signal. The post before argued that the best homogenization correction method we have can remove network-wide trend biases perfectly if all breaks are known. In the light of the last post, we naturally would like to know how well this correction method can remove such biases in the more realistic case when the breaks are imperfectly estimated. That should still be studied much better, but it is interesting to discuss a number of other studies on the removal of network-wide trend biases from the perspective of this new understanding.

So this post will argue that it theoretically makes sense that (unavoidable) inaccuracies of break detection lead to network-wide trend biases only being partially corrected by statistical homogenization.

1) We have seen this in our study of the correction method in response to small errors in the break positions (Lindau and Venema, 2018).

2) The benchmarking study of NOAA’s homogenization algorithm shows that if the breaks are big and easy they are largely removed, while in the scenario where breaks are plentiful and small half of the trend bias remains (Williams et al., 2012).

3) Another benchmarking study show that with the network density of Switzerland homogenization can find and remove clear trend biases, while if you thin this network to be similar to Peru the bias cannot be removed (Gubler et al., 2017).

4) Finally, a benchmarking study of relative humidity station observations in Austria could not remove much of the trend bias, which is likely because relative humidity is not correlated well from station to station (Chimani et al., 2018).

Statistical homogenization on a global scale makes warming estimates larger (Lawrimore et al., 2011; Menne et al., 2018). Thus if it can only remove part of any trend bias, this would mean that quite likely the actual warming was larger.


Figure 1: The inserted versus remaining network-mean trend error. Upper panel for perfect breaks. Lower panel for a small perturbation of the break position. The time series are 100 annual values and have 5 break. Figure 10 in Lindau and Venema (2018).

Joint correction method

First, what did our study on the correction method (Lindau and Venema, 2018) say about the importance of errors in the break position? As the paper was mostly about perfect breaks, we assumed that all breaks were known, but that they had a small error in their position. In the example to the right, we perturbed the break position by a normally distributed random number with standard deviation one (lower panel), while for comparison the breaks are perfect (upper panel).

In both cases we inserted a large network-wide trend bias of 0.873 °C over the length of the century long time series. The inserted errors for 1000 simulations is on the x-axis, the average inserted trend bias is denoted by x̅. The remaining error after homogenization is on the y-axis. Its average is denoted by y̅ and basically zero in case the breaks are perfect (top panel). In case of the small perturbation (lower panel) the average remaining error is 0.093 °C, this is 11 % of the inserted trend bias. That is the under-correction for is a quite small perturbation: 38 % of the positions is not changed at all.

If the standard deviation of the position perturbation is increased to 2, the remaining trend bias is 21 % of the inserted bias.

In the upper panel, there is basically no correlation between the inserted and the remaining error. That is, the remaining error does not depend on the break signal, but only on the noise. In the lower panel with the position errors, there is a correlation between the inserted and remaining trend error. So in this more realistic case, it does matter how large the trend bias due to the inhomogeneities is.

This is naturally an idealized case, position errors will be more complicated in reality and there would be spurious and missing breaks. But this idealized case fitted best to the aim of the paper of studying the correction algorithm in isolation.

It helps understand where the problem lies. The correction algorithm is basically a regression that aims to explain the inserted break signal (and the regional climate signal). Errors in the predictors will lead to an explained variance that is less than 100 %. One should thus expect that the estimated break signal is smaller than the actual break signal. It is thus expected that the trend change due to the estimated break signal produces is smaller than the actual trend change due to the inhomogeneities.

NOAA’s benchmark

That statistical homogenization under-corrects when the going gets tough is also found by the benchmarking study of NOAA’s Pairwise Homogenization Algorithm in Williams et al. (2012). They simulated temperature networks like the American USHCN network and added inhomogeneities according to a range of scenarios. (Also with various climate change signals.) Some scenarios were relatively easy, had few and large breaks, while others were hard and contained many small breaks. The easy cases were corrected nearly perfectly with respect to the network-wide trend, while in the hard cases only half of the inserted network-wide trend error was removed.

The results of this benchmarking for the three scenarios with a network-wide trend bias are shown below. The three panels are for the three scenarios. Each panel has results (the crosses, ignore the box plots) for three periods over which the trend error was computed. The main message is that the homogenized data (orange crosses) lies between the inhomogeneous data (red crosses) and the homogeneous data (green crosses). Put differently, green is how much the climate actually changed, red is how much the estimate is wrong due to inhomogeneities, orange shows that homogenization moves the estimate towards the truth, but never fully gets there.

If we use the number of breaks and their average size as a proxy for the difficulty of the scenario, the one on the left has 6.4 breaks with an average size of 0.8 °C, the one in the middle 8.4 breaks (size 0.4 °C) and the one on the right 10 breaks (size 0.4 °C). So this suggests there is a clear dose effect relationship; although there surely is more than just the number of breaks.


Figures from Williams et al. (2012) showing the results for three scenarios. This is a figure I created from parts of Figure 7 (left), Figure 5 (middle) and Figure 10 (right; their numbers).

When this study appeared in 2012, I found the scenario with the many small breaks much too pessimistic. However, our recent study estimating the properties of the inhomogeneities of the American network found a surprisingly large number of breaks: more than 17 per century; they were bigger: 0.5 °C. So purely based on the number of breaks the hardest scenario is even optimistic, but also size matters.

Not that I would already like to claim that even in a dense network like the American there is a large remaining trend bias and the actual warming was much larger. There is more to the difficulty of inhomogeneities than their number and size. It sure is worth studying.

Alpine benchmarks

The other two examples in the literature I know of are examples of under-correction in the sense of basically no correction because the problem is simply too hard. Gubler et al. (2017) shows that the raw data of the Swiss temperature network has a clear trend bias, which can be corrected with homogenization of its dense network (together with metadata), but when they thin the network to a network density similar to that of Peru, they are unable to correct this trend bias. For more details see my review of this article in the Grassroots Review Journal on Homogenization.

Finally, Chimani et al. (2018) study the homogenization of daily relative humidity observations in Austria. I made a beautiful daily benchmark dataset, it was a lot of fun: on a daily scale you have autocorrelations and a distribution with an upper and lower limit, which need to be respected by the homogeneous data and the inhomogeneous data. But already the normal homogenization of monthly averages was much too hard.

Austria has quite a dense network, but relative humidity is much influenced by very local circumstances and does not correlate well from station to station. My co-authors of the Austrian weather service wanted to write about the improvements: "an improvement of the data by homogenization was non‐ideal for all methods used". For me the interesting finding was: nearly no improvement was possible. That was unexpected. Had we expected that we could have generated a much simpler monthly or annual benchmark to show no real improvement was possible for humidity data and saved us a lot of (fun) work.

What does this mean for global warming estimates?

When statistical homogenization only partially removes large-scale trend biases what does this mean for global warming estimates? In the global temperature datasets statistical homogenization leads to larger warming estimates. So if we tend to underestimate how much correction is needed, this would mean that the Earth most likely warmed up more than current estimates indicate. How much exactly is hard to tell at the moment and thus needs a nuanced discussion. Let me give you my considerations in the next post.


Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization

References

Chimani Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122. https://doi.org/10.1002/joc.5488.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683. https://doi.org/10.1002/joc.5114

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russell S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal Geophysical Research, 116, D19121. https://doi.org/10.1029/2011JD016187

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: https://eartharxiv.org/r57vf/, paywalled article: https://doi.org/10.1002/joc.5728

Menne, Matthew J., Claude N. Williams, Byron E. Gleason, Jared J. Rennie and Jay H. Lawrimore, 2018: The Global Historical Climatology Network Monthly Temperature Dataset, Version 4. Journal of Climate, 31, 9835–9854.
https://doi.org/10.1175/JCLI-D-18-0094.1

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116. https://doi.org/10.1029/2011JD016761

Thursday, April 23, 2020

Correcting inhomogeneities when all breaks are perfectly known

Much of the scientific literature on the statistical homogenization of climate data is about the detection of breaks, especially the literature before 2012. Much of the more recent literature studies complete homogenization algorithms. That leaves a gap for the study of correction methods.

Spoiler: if we know all the breaks perfectly, the correction method removes trend biases from a climate network perfectly. I found the most surprising outcome that in this case the size of the breaks is irrelevant for how well the correction method works, what matters is the noise.

This post is about a study filling this gap by Ralf Lindau and me. The post assumes you are familiar with statistical homogenization, if not you can find more information here. For correction you naturally need information on the breaks. To study correction in isolation as much as possible, we have assumed that all breaks are known. That is naturally quite theoretical, but it makes it possible to study the correction method in detail.

The correction method we have studied is a so-called joint correction method, that means that the corrections for all stations in a network are computed in one go. The somewhat unfortunate name ANOVA is typically used for this correction method. The equations are the same as those of the ANOVA test, but the application is quite different, so I find this name confusing.

This correction method makes three assumptions. 1) That all stations have the same regional climate signal. 2) That every station has its own break signal, which is a step function with the positions of the steps given by the known breaks. 3) That every station also has its own measurement and weather noise. The algorithm computes the values of the regional climate signal and the levels of the step functions by minimizing this noise. So in principle the method is a simple least square regression, but with much more coefficients than when you use it to compute a linear trend.

Three steps

In this study we compute the errors after correction in three ways, one after another. To illustrate this let’s start simple and simulate 1000 networks of 10 stations with 100 years/values. In the first examples below these stations have exactly five breaks, whose sizes are drawn from a normal distribution with variance one. The noise, simulating measurement uncertainties and differences in local weather, is also noise with a variance of one. This is quite noisy for European temperature annual averages, but happens earlier in the climate record and in other regions. Also to keep it simple there is no net trend bias yet.

The figure to the right is a scatterplot with theoretically 1000*10*100=1 million yearly temperature averages as they were simulated (on the x-axis) and after correction (y-axis).

Within the plots we show some statistics, on the top left these are 1) the mean of x, i.e. the mean of the inserted inhomogeneities. 2) Then the variance of the inserted inhomogeneities x. 3) Then the mean of the computed corrections y. 4) Finally the variance of the corrections.

In the lower right, 1) the correlation (r) is shown and 2) the number of values (n). For technical reasons, we only show a sample of the 1 million points in the scatterplot, but these statistics are based on all values.

The results look encouraging, they show a high correlation: 0.96. And the points nicely scatter around the x=y line.

The second step is to look at the trends of the stations, there is one trend per station, so we have 1000*10=10,000 of them. See figure to the right. The trend is computed in the standard way using least squares linear regression. Trends would normally have the unit °C per year or century. Here we multiplied the trend with the period, so the values are the total change due to the trend and have unit °C.

The values again scatter beautifully around x=y and the correlation is as high as before: 0.95.

The final step is to compute the 1000 network trends. The result is shown below. The averaging over 10 stations reduces the noise in the scatterplot and the values beautifully scatter around the x=y line, while the correlation is now smaller, it is still decent: 0.81. Remember we started with quite noisy data where the noise was as large as the break signal.



The remaining error

In the next step, rather than plotting the network trend after correction on the y-axis, we plotted the difference between this trend and the inserted network mean trend, which is the trend error remaining after correction. This is plotted in the left panel below. For this case the uncertainty after correction is half of the uncertainty before correction in terms of the printed variances. It is typical to express uncertainties as standard deviations, then the remaining trend error is 71%. Furthermore, their averages are basically zero, so no bias is introduced.

With a signal having as much break variance as noise variance from the measurement and weather differences between the stations, the correction algorithm naturally cannot reconstruct the original inhomogeneities perfectly, but it does so decently and its errors have nice statistical properties.

Now if we increase the variance of the break signal by a factor two we get the result shown in the right panel. Comparing the two panels, it is striking that the trend error after correction is the same, it does not depend on the break signal, only the noise determines how accurate the trends are. In case of large break signals this is nice, but if the break signal is small, this will also mean that the the correction can increase the random trend error. That could be the case in regions where the networks are sparse and the difference time series between two neighboring stations consequently quite noisy.



Large-scale trend biases

This was all quite theoretical, as the networks did not have a bias in their trends. They did have a random trend error due to the inserted inhomogeneities, but averaging over many such networks of 10 stations the trend error would tend to zero. If that were the case in reality, not many people would work on statistical homogenization. The main aim is the reduce the uncertainties in large-scale trends due to (possible) large-scale biases in the trends.

Such large-scale trend biases can be caused by changes in the thermometer screens used, the transition from manual to automatic observations, urbanization around the stations or relocations of stations to better sites.

If we add a trend bias to the inserted inhomogeneities and correct the data with the joint correction method, we find the result to the right. We inserted a large trend bias to all networks of 0.9 °C and after correction it was completely removed. This again does not depend on the size of the bias or the variance of the break signal.

However, this all is only true if all breaks are known, before I write a post about the more realistic case were the breaks are perfectly known, I will first have to write a post about how well we can detect breaks. That will be the next homogenization post.

Some equations

Next to these beautiful scatterplots, the article has equations for each of the the above mentioned three steps 1) from the inserted breaks and noise to what this means for the station data, 2) how this affects the station trend errors, and 3) how this results in network trends.

With equations for the influence of the size of the break signal (the standard deviation of the breaks) and the noise of the difference time series (the standard deviation of the noise) one can then compute how the trend errors before and after correction depend on the signal to noise ratio (SNR), which is the standard deviation of the breaks divided by the standard deviation of the noise. There is also a clear dependence on the number of breaks.

Whether the network trends increase or decrease due to the correction method is determined by the quite simple equation: 6 times the SNR divided by the number of breaks. So if the SNR is one, as in the initial example of this post and the number of breaks is 6 or smaller the correction would improve the trend error, while if there are more than 7 breaks the correction would add a random trend error. This simple equation ignores a weak dependence of the results on the number of stations in the networks.

Further research

I started saying that the correction methods was a research gap, but homogenization algorithms have many more steps beyond detection and correction, which should also be studied in isolation if possible to gain a better understanding.

For example, the computation of a composite reference. The selection of reference stations. The combination of statistical homogenization with metadata on documented changes in the measurement setup. And so on. The last chapter of the draft guidance on homogenization describes research needs, including research on homogenization methods. There are still of lot of interesting and important questions.


Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization

References

Lindau, R, V. Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: https://eartharxiv.org/r57vf/ Article: https://doi.org/10.1002/joc.5728