Wednesday, 10 July 2013

Statistical problems: The multiple breakpoint problem in homogenization and remaining uncertainties

This is part two of a series on statistically interesting problems in the homogenization of climate data. The first part was about the inhomogeneous reference problem in relative homogenization. This part will be about two problems: the multiple breakpoint problem and about computing the remaining uncertainties in homogenized data.

I hope that this series can convince statisticians to become (more) active in homogenization of climate data, which provides many interesting problems.

The five main statistical problems are:
Problem 1. The inhomogeneous reference problem
Neighboring stations are typically used as reference. Homogenization methods should take into account that this reference is also inhomogeneous
Problem 2. The multiple breakpoint problem
A longer climate series will typically contain more than one break. Methods designed to take this into account are more accurate as ad-hoc solutions based single breakpoint methods
Problem 3. Computing uncertainties
We do know about the remaining uncertainties of homogenized data in general, but need methods to estimate the uncertainties for a specific dataset or station
Problem 4. Correction as model selection problem
We need objective selection methods for the best correction model to be used
Problem 5. Deterministic or stochastic corrections?
Current correction methods are deterministic. A stochastic approach would be more elegant

Problem 2. The multiple breakpoint problem

For temperature time series about one break per 15 to 20 years is typical. Thus most interesting stations will contain more than one break. Unfortunately, most statistical detection methods have been developed for one break. To use them on series with multiple breaks, one ad-hoc solution is to first split the series at the largest break (for example the standard normalized homogeneity test, SNHT) and investigate the subseries. Such a greedy algorithm does not always find the optimal solution.

Another solution is to detect breaks on short windows. The window should be short enough to contain only one break, which reduces power of detection considerably.

Multiple breakpoint methods can find an optimal solution and are nowadays numerically feasible. Especially using the optimization methods “dynamic programming”. For a certain number of breaks these methods find the break combination that minimize the internal variance, that is variance of the homogeneous subperiods, (or you could also state that the break combination maximizes the variance of the breaks). To find the optimal number of breaks, a penalty is added that increases with the number of breaks. Examples of such methods are PRODIGE (Caussinus & Mestre, 2004) or ACMANT (based on PRODIGE; Domonkos, 2011). In a similar line of research Lu et al. (2010) solved the multiple breakpoint problem using a minimum description length (MDL) based information criterion as penalty function.


This figure shows a screen shot of PRODIGE to homogenize Salzburg with its neighbors (click to enlarge). The neighbors are sorted based on their cross-correlation with Salzburg. The top panel is the difference time series of Salzburg with Kremsmünster, which has a standard deviation of 0.14°C. The middle panel is the difference between Salzburg and München (0.18°C). The lower panel is the difference of Salzburg and Innsbruck (0.29°C). Not having any experience with PRODIGE, I would read this graph as suggesting that Salzburg probably has breaks in 1902, 1938 and 1995. This fits to the station history. In 1903 the station was moved to another school. In 1939 it was relocated to the airport and in 1996 it was moved on the terrain of the airport. The other breaks are not consistently seen in multiple pairs and may thus well be in another station.

Recently this penalty function was found to be suboptimal (Lindau & Venema, 2013a). It was found that the penalty should be function of the number of breaks, not fixed per break and that the relation with the length of the series should be reversed. A better penalty function is thus needed. See this post for more information on the multiple breakpoint problem and this article

Multiple breakpoint methods are much more accurate as single breakpoint methods combined with ad-hoc fixes. This expected higher accuracy founded theoretically (Hawkins, 1972). In addition, in a recent benchmarking study (a numerical validation study using realistic datasets) of the European project HOME, it was found that modern homogenization methods, which take the multiple breakpoint and the inhomogeneous reference problems into account, are about a factor two more accurate as traditional methods (Venema et al., 2012).

Problem 3. Computing uncertainties

Also after homogenization uncertainties remain in the data due to various problems.
  1. Not all breaks in the candidate station can be detected
  2. Uncertainty in the estimation of correction parameters due to insufficient data
  3. Uncertainties in the corrections due to remaining inhomogeneities in the references
  4. The date of the break may be imprecise (see Lindau & Venema, 2013b)
From validation and benchmarking studies we have a reasonable idea about the remaining uncertainties that one can expect in the homogenized data. At least with respect to the mean. For daily data individual developers have validated their methods, but systematic validation and comparison studies are still missing.

Furthermore, such studies only provide a general uncertainty level, whereas more detailed information for every single station and period would be valuable. The uncertainties will strongly depend on the inhomogeneity of the raw data and on the quality and cross-correlations of the reference stations. Both of which vary strongly per station, region and period.

Communicating such a complicated errors structure, which is mainly temporal, but also partially spatial, is a problem in itself. Maybe generating an ensemble of possible realizations, similar to Brohan et al. (2006) could provide a workable route. Furthermore, not only the uncertainty in the means should be considered, but, especially for daily data, uncertainties in the complete probability density function need to be estimated and communicated.

Related posts

All posts in this series:
Problem 1. The inhomogeneous reference problem
Neighboring stations are typically used as reference. Homogenization methods should take into account that this reference is also inhomogeneous
Problem 2. The multiple breakpoint problem
A longer climate series will typically contain more than one break. Methods designed to take this into account are more accurate as ad-hoc solutions based single breakpoint methods
Problem 3. Computing uncertainties
We do know about the remaining uncertainties of homogenized data in general, but need methods to estimate the uncertainties for a specific dataset or station
Problem 4. Correction as model selection problem
We need objective selection methods for the best correction model to be used
Problem 5. Deterministic or stochastic corrections?
Current correction methods are deterministic. A stochastic approach would be more elegant
Previously, I wrote a longer explanation of the multiple breakpoint problem.

In previous posts I have discussed future research in homogenization from a climatological perspective.

Future research in homogenisation of climate data – EMS 2012 in Poland

HUME: Homogenisation, Uncertainty Measures and Extreme weather

A database with daily climate data for more reliable studies of changes in extreme weather

References

Brohan, P., J. Kennedy, I. Harris, S.F.B. Tett and P.D. Jones. Uncertainty estimates in regional and global observed temperature changes: a new dataset from 1850. Journal of Geophysical Research, 111, no. D12106, 2006.

Caussinus, H. and O. Mestre. Detection and correction of artificial shifts in climate series. Applied Statistics, 53, pp. 405–425, doi: 10.1111/j.1467-9876.2004.05155.x, 2004.

Domonkos, P. Adapted Caussinus-Mestre Algorithm for Networks of Temperature series (ACMANT). International Journal of Geosciences, 2, 293-309, doi: 10.4236/ijg.2011.23032, 2011.

Hawkins, D.M. On the choice of segments in piecewise approximation. Journal of the Institute of Mathematics and its Applications, 9, pp. 250–256, 1972.

Lindau, R. and V.K.C. Venema. On the multiple breakpoint problem and the number of significant breaks in homogenisation of climate records. Idojaras, Quarterly journal of the Hungarian Meteorological Service, 117, no. 1, pp. 1-34, 2013a.

Lindau, R. and V.K.C. Venema. Break position errors in climate records. 12th International Meeting on Statistical Climatology, IMSC2013, Jeju, South Korea, 24-28 June, 2013b

Lu, Q., R.B. Lund, and T.C.M. Lee. An MDL approach to the climate segmentation problem. Annals of Applied Statistics, 4, no. 1, pp. 299-319, doi: 10.1214/09-AOAS289, 2009. (Ungated manuscript)

Venema, V., O. Mestre, E. Aguilar, I. Auer, J.A. Guijarro, P. Domonkos, G. Vertacnik, T. Szentimrey, P. Stepanek, P. Zahradnicek, J. Viarre, G. Müller-Westermeier, M. Lakatos, C.N. Williams, M.J. Menne, R. Lindau, D. Rasol, E. Rustemeier, K. Kolokythas, T. Marinova, L. Andresen, F. Acquaotta, S. Fratianni, S. Cheval, M. Klancar, M. Brunetti, Ch. Gruber, M. Prohom Duran, T. Likso, P. Esteban, Th. Brandsma. Benchmarking homogenization algorithms for monthly data. Climate of the Past, 8, pp. 89-115, doi: 10.5194/cp-8-89-2012, 2012.

No comments: