Break detection is deceptive when the noise is larger than the break signal

I am disappointed in science. It is impossible that it took this long for us to discover that break detection has serious problems when the signal to noise ratio is low. However, as far as we can judge this was new science and it certainly was not common knowledge, which it should have been because it has large consequences.

This post describes a paper by Ralf Lindau and me about how break detection depends on the signal to noise ratio (Lindau and Venema, 2018). The signal in this case are the breaks we would like to detect. These breaks could be from a change in instrument or location of the station. We detect breaks by comparing a candidate station to a reference. This reference can be one other neighbouring station or an average of neighbouring stations. The candidate and reference should be sufficiently close so that they have the same regional climate signal, which is then removed by subtracting the reference from the candidate. The difference time series that is left contains breaks and noise because of measurement uncertainties and differences in local weather. The noise thus depends on the quality of the measurements, on the density of the measurement network and on how variable the weather is spatially.

The signal to noise ratio (SNR) is simply defined as the standard deviation of the time series containing only the breaks divided by the standard deviation of time series containing only the noise. For short I will denote these as the break signal and the noise signal, which have a break variance and a noise variance. When generating data to test homogenization algorithms, you know exactly how strong the break signal and the noise signal is. In case of real data, you can estimate it, for example with the methods I described in a previous blog post. In that study, we found a signal to noise ratio for annual temperature averages observed in Germany of 3 to 4 and in America of about 5.

Temperature is studied a lot and much of the work on homogenization takes place in Europe and America. Here this signal to noise ratio is high enough. That may be one reason why climatologists did not find this problem sooner. Many other sciences use similar methods, we are all supported by a considerable statistical literature. I have no idea what their excuses are.

Why a low SNR is a problem

As scientific papers go, the discussion is quite mathematical, but the basic problem is relatively easy to explain in words. In statistical homogenization we do not know in advance where the break or breaks will be. So we basically try many break positions and search for the break positions that result in the largest breaks (or, for the algorithm we studied, that explain the most variance).

If you do this for a time series that contains only noise, this will also produce (small) breaks. For example, in case you are looking for one break, due to pure chance there will be a difference between the averages of the first and the last segment. This difference is larger than it would be for a predetermined break position, as we try all possible break positions and then select the one with the largest difference. To determine whether the breaks we found are real, we require that they are so large that it is unlikely that they are due to chance, while there are actually no breaks in the series. So we study how large breaks are in series that only contains noise to determine how large such random breaks are. Statisticians would talk about the breaks being statistically significant with white noise as the null hypothesis.

When the breaks are really large compared to the noise one can see by eye where the positions of the breaks are and this method is nice to make this computation automatically for many stations. When the breaks are “just” large, it is a great method to objectively determine the number of breaks and the optimal break positions.

The problem comes when the noise is larger than the break signal. Not that it is fundamentally impossible to detect such breaks. If you have a 100-year time series with a break in the middle, you would be averaging over 50 noise values on either side and the difference in their averages would be much smaller than the noise itself. Even if noise and signal are about the same size the noise effect is thus expected to be smaller than the size of such a break. To put it in another way, the noise is not correlated in time, while the break signal is the same for many years; that fundamental difference is what the break detection exploits.

However, to come to the fundamental problem, it becomes hard to determine the positions of the breaks. Imagine the theoretical case where the break positions are fully determined by the noise, not by the breaks. From the perspective of the break signal, these break positions are random. The problem is, also random breaks explain a part of the break signal. So one would have a combination with a maximum contribution of the noise plus a part of the break signal. Because of this additional contribution by the break signal, this combination may have larger breaks than expected in a pure noise signal. In other words, the result can be statistically significant, while we have no idea where the positions of the breaks are.

In a real case the breaks look even more statistically significant because the positions of the breaks are determined by both the noise and the break signal.

That is the fundamental problem, the test for the homogeneity of the series rightly detects that the series contains inhomogeneities, but if the signal to noise ratio is low we should not jump to conclusions and expect that the set of break positions that gives us the largest breaks has much to do with the break positions in the data. Only if the signal to noise ratio is high, this relationship is close enough.

Some numbers

This is a general problem, which I expect all statistical homogenization algorithms to have, but to put some numbers on this, we need to specify an algorithm. We have chosen to study the multiple breakpoint method that is implemented in PRODIGE (Caussinus and Mestre, 2004), HOMER (Mestre et al., 2013) and ACMANT (Domonkos and Coll, 2017), these are among the best, if not the best, methods we currently have. We applied it by comparing pairs of stations, like PRODIGE and HOMER do.

For a certain number of breaks this method effectively computes the combination of breaks that has the highest break variance. If you add more breaks, you will increase the break variance those breaks explain, even if it were purely due to noise, so there is additionally a penalty function that depends on the number of breaks. The algorithm selects that option where the break variance minus such a penalty is highest. A statistician would call this a model selection problem and the job of the penalty is to keep the statistical model (the step function explaining the breaks) reasonably simple.

In the end, if the signal to noise ratio is one half, the breaks that explain the largest breaks are just as “good” at explaining the actual break signal in the data as breaks at random positions.

With this detection model, we derived the plot below, let me talk you through this. On the x-axis is the SNR, on the right the break signal is twice as strong as the noise signal. On the y-axis is how well the step function belonging to the detected breaks fits to the step function of the breaks we actually inserted. The lower curve, with the plus symbols, is the detection algorithm as I described above. You can see that for a high SNR it finds a solution that closely matches what we put in and the difference is almost zero. The upper curve, with the ellipse symbols, is for the solution you find if you put in random breaks. You can see that for a high SNR the random breaks have a difference of 0.5. As the variance of the break signal is one, this means that half the variance of the break signal is explained by random breaks.

Figure 13b from Lindau and Venema (2018).

When the SNR is about 0.5, the random breaks are about as good as the breaks proposed by the algorithm described above.

One may be tempted to think that if the data is too noisy, the detection algorithm should detect less breaks, that is, the penalty function should be bigger. However, the problem is not the detection of whether there are breaks in the data, but where the breaks are. A larger penalty thus does not solve the problem and even makes the results slightly worse. Not in the paper, but later I wondered whether setting more breaks is such a bad thing, so we also tried lowering the threshold, this again made the results worse.

So what?

The next question is naturally: is this bad? One reason to investigate correction methods in more detail, as described in my last blog post, was the hope that maybe accurate break positions are not that important. It could have been that the correction method still produces good results even with random break positions. This is unfortunately not the case, already quite small errors in break positions deteriorate the outcome considerably, this will be the topic of the next post.

Not homogenizing the data is also not a solution. As I described in a previous blog post, the breaks in Germany are small and infrequent, but they still have a considerable influence on the trends of stations. The figure below shows the trend differences between many pairs of nearby stations in Germany. Their differences in trends will be mostly due to inhomogeneities. The standard deviation of 0.628 °C per century for the pairs translated to an average error in the trends of individual stations of 0.4 °C per century.

The trend differences (y-axis) of pairs of stations (x-axis) in the German temperature network. The trends were computed from 316 nearby pairs over 1950 to 2000. Figure 2 from Lindau and Venema (2018).

This finding makes it more important to work on methods to estimate the signal to noise ratio of a dataset before we try to homogenize it. This is easier said than done. The method introduced in Lindau and Venema (2018) gives results for every pair of stations, but needs some human checks to ensure the fits are good. Furthermore, it assumes the break levels behave like noise, while in Venema and Lindau (2019) we found that the break signal in the USA behaves like a random walk. This 2019 method needs a lot of data, even the results for Germany are already quite noisy, if you apply it to data sparse regions you have to select entire continents. Doing so, however, biases the results to those subregions were the there are many stations and would thus give too high SNR estimates. So computing SNR worldwide is not just a blog post, but requires a careful study and likely the development of a new method to estimate the break and noise variance.

Both methods compute the SNR for one difference time series, but in a real case multiple difference time series are used. We will need to study how to do this in an elegant way. How many difference series are used depends on the homogenization method, this would also make the SNR method dependent. I would appreciate to also have an estimation method that is more universal and can be used to compare networks with each other.

This estimation method should then be applied to global datasets and for various periods to study which regions and periods have a problem. Temperature (as well as pressure) are variables that are well correlated from station to station. Much more problematic variables, which should thus be studied as well, are precipitation, wind, humidity. In case of precipitation, there tend to be more stations. This will compensate some, but for the other variables there may even be less stations.

We have some ideas how to overcome this problem, from ways to increase the SNR to completely different ways to estimate the influence of inhomogeneities on the data. But they are too preliminary to already blog about. Do subscribe to the blog with any of the options below the tag cloud near the end of the page. ;-)

When we digitize climate data that is currently only available on paper, we tend to prioritize data from regions and periods where we do not have much information yet. However, if after that digitization the SNR would still be low, it may be more worthwhile to digitize data from regions/periods where we already have more data and get that region/period to a SNR above one.

The next post will be about how this low SNR problem changes our estimates of how much the Earth has been warming. Spoiler: the climate “sceptics” will not like that post.

References

Caussinus, Henri and Olivier Mestre, 2004: Detection and correction of artificial shifts in climate series. The Journal of the Royal Statistical Society, Series C (Applied Statistics), 53, pp. 405-425. https://doi.org/10.1111/j.1467-9876.2004.05155.x

Domonkos, Peter and John Coll, 2017: Homogenisation of temperature and precipitation time series with ACMANT3: method description and efficiency tests. International Journal of Climatology, 37, pp. 1910-1921. https://doi.org/10.1002/joc.4822

Lindau, Ralf and Victor Venema, 2018: The joint influence of break and noise variance on the break detection capability in time series homogenization. Advances in Statistical Climatology, Meteorology and Oceanography, 4, p. 1–18. https://doi.org/10.5194/ascmo-4-1-2018

Lindau, R, Venema, V., 2019: A new method to study inhomogeneities in climate records: Brownian motion or random deviations? International Journal Climatology, 39: p. 4769– 4783. Manuscript: https://eartharxiv.org/vjnbd/ Article: https://doi.org/10.1002/joc.6105

Mestre, Olivier, Peter Domonkos, Franck Picard, Ingeborg Auer, Stephane Robin, Émilie Lebarbier, Reinhard Boehm, Enric Aguilar, Jose Guijarro, Gregor Vertachnik, Matija Klancar, Brigitte Dubuisson, Petr Stepanek, 2013: HOMER: a homogenization software - methods and applications. IDOJARAS, Quarterly Journal of the Hungarian Meteorological Society, 117, no. 1, pp. 47–67.

Variable Variability

Pages

Monday, 27 April 2020