Monday, 12 October 2020

The deleted chapter of the WMO Guidance on the homogenisation of climate station data

The Task Team on Homogenization (TT-HOM) of the Open Panel of CCl Experts on Climate Monitoring and Assessment (OPACE-2) of the Commission on Climatology (CCl) of the World Meteorological Organization (WMO) has published their Guidance on the homogenisation of climate station data.

The guidance report was a bit longish, so at the end we decided that the last chapter on "Future research & collaboration needs" was best deleted. As chair of the task team and as someone who likes tp dream about what others could do in a comfy chair, I wrote most of this chapter and thus we decided to simply make it a blog post for this blog. Enjoy.


This guidance is based on our current best understanding of inhomogeneities and homogenisation. However, writing it also makes clear there is a need for a better understanding of the problems.

A better mathematical understanding of statistical homogenisation is important because that is what most of our work is based on. A stronger mathematical basis is a prerequisite for future methodological improvements.

A stronger focus on a (physical) understanding of inhomogeneities would complement and strengthen the statistical work. This kind of work is often performed at the station or network level, but also needed at larger spatial scales. Much of this work is performed using parallel measurements, but they are typically not internationally shared.

In an observational science the strength of the outcomes depends on a consilience of evidence. Thus having evidence on inhomogeneities from both statistical homogenisation and physical studies strengthens the science.

This chapter will discuss the needs for future research on homogenisation grouped in five kinds of problems. In the first section we will discuss research on improving our physical understanding and physics-based corrections. The next section is about break detection, especially about two fundamental problems in statistical homogenisation: the inhomogeneous-reference problem and the multiple-breakpoint problem.

Next write about computing uncertainties in trends and long-term variability estimates from homogenised data due to remaining inhomogeneities. It may be possible to improve correction methods by treating it as a statistical model selection problem. The last section discusses whether inhomogeneities are stochastic or deterministic and how that may affect homogenisation and especially correction methods for the variability around the long-term mean.

For all the research ideas mentioned below, it is understood that in future we should study more meteorological variables than temperature. In addition, more studies on inhomogeneities across variables could be helpful to understand the causes of inhomogeneities and increase the signal to noise ratio. Homogenisation by national offices has advantages because here all climate elements from one station are stored together. This helps in understanding and identifying breaks. It would help homogenisation science and climate analysis to have a global database for all climate elements, like iCOADS for marine data. A Copernicus project has started working on this for land station data, which is an encouraging development.

Physical understanding

It is a good scientific practice to perform parallel measurements in order to manage unavoidable changes and to compare the results of statistical homogenisation to the expectations given the cause of the inhomogeneity according to the metadata. This information should also be analysed on continental and global scales to get a better understanding of when historical transitions took place and to guide homogenisation of large-scale (global) datasets. This requires more international sharing of parallel data and standards on the reporting of the size of breaks confirmed by metadata.

The Dutch weather service KNMI published a protocol how to manage possible future changes of the network, who decides what needs to be done in which situation, what kind of studies should be made, where the studies should be published and that the parallel data should be stored in their central database as experimental data. A translation of this report will soon be published by the WMO (Brandsma et al., 2019) and will hopefully inspire other weather services to formalise their network change management.

Next to statistical homogenisation, making and studying parallel measurements, and other physical estimates, can provide a second line of evidence on the magnitude of inhomogeneities. Having multiple lines of evidence provides robustness to observational sciences. Parallel data is especially important for the large historical transitions that are most likely to produce biases in network-wide to global climate datasets. It can validate the results of statistical homogenisation and be used to estimate possibly needed additional adjustments. The Parallel Observations Science Team of the International Surface Temperature Initiative (ISTI-POST) is working on building such a global dataset with parallel measurements.

Parallel data is especially suited to improve our physical understand of the causes of inhomogeneities by studying how the magnitude of the inhomogeneity depends on the weather and on instrumental design characteristics. This understanding is important for more accurate corrections of the distribution, for realistic benchmarking datasets to test our homogenisation methods and to determine which additional parallel experiments are especially useful.

Detailed physical models of the measurement, for example, the flow through the screens, radiative transfer and heat flows, can also help gain a better understanding of the measurement and its error sources. This aids in understanding historical instruments and in designing better future instruments. Physical models will also be paramount for understanding the impact of the surrounding on the measurement — nearby obstacles and surfaces influencing error sources and air flow — to changes in the measurand, such as urbanisation/deforestation or the introduction of irrigation. Land-use changes, especially urbanisation, should be studied together with relocations they may provoke.

Break detection

Longer climate series typically contain more than one break. This so-called multiple-breakpoint problem is currently an important research topic. A complication of relative homogenisation is that also the reference stations can have inhomogeneities. This so-called inhomogeneous-reference problem is not optimally solved yet. It is also not clear what temporal resolution is best for detection and what the optimal way is to handle the seasonal cycle in the statistical properties of climate data and of many inhomogeneities.

For temperature time series about one break per 15 to 20 years is typical and multiple breaks are thus common. Unfortunately, most statistical detection methods have been developed for one break and for the null hypothesis of white (sometimes red) noise. In case of multiple breaks the statistical test should not only take the noise variance into account, but also the break variance from breaks at other positions. For low signal to noise ratios, the additional break variance can lead to spurious detections and inaccuracies in the break position (Lindau and Venema, 2018a).

To apply single-breakpoint tests on series with multiple breaks, one ad-hoc solution is to first split the series at the most significant break (for example, the standard normalised homogeneity test, SNHT) and investigate the subseries. Such a greedy algorithm does not always find the optimal solution. Another solution is to detect breaks on short windows. The window should be short enough to contain only one break, which reduces power of detection considerably. This method is not used much nowadays.

Multiple breakpoint methods can find an optimal solution and are nowadays numerically feasible. This can be done in a hypothesis testing (MASH) or in a statistical model selection framework. For a certain number of breaks these methods find the break combination that minimize the internal variance, that is variance of the homogeneous subperiods, (or you could also state that the break combination maximizes the variance of the breaks). To find the optimal number of breaks, a penalty is added that increases with the number of breaks. Examples of such methods are PRODIGE (Caussinus & Mestre, 2004) or ACMANT (based on PRODIGE; Domonkos, 2011b). In a similar line of research Lu et al. (2010) solved the multiple breakpoint problem using a minimum description length (MDL) based information criterion as penalty function.

This penalty function of PRODIGE was found to be suboptimal (Lindau and Venema, 2013). It was found that the penalty should be a function of the number of breaks, not fixed per break and that the relation with the length of the series should be reversed. It is not clear yet how sensitive homogenisation methods respond to this, but increasing the penalty per break in case of low SNR to reduce the number of breaks does not make the estimated break signal more accurate (Lindau and Venema, 2018a).

Not only the candidate station, also the reference stations will have inhomogeneities, which complicates homogenisation. Such inhomogeneities can be climatologically especially important when they are due to network-wide technological transitions. An example of such a transition is the current replacement of temperature observations using Stevenson screens by automatic weather stations. Such transitions are important periods as they may cause biases in the network and global average trends and they produce many breaks over a short period.

A related problem is that sometimes all stations in a network have a break at the same date, for example, when a weather service changes the time of observation. Nationally such breaks are corrected using metadata. If this change is unknown in global datasets one can still detect and correct such inhomogeneities statistically by comparison with other nearby networks. That would require an algorithm that additionally knows which stations belong to which network and prioritizes correcting breaks found between stations in different networks. Such algorithms do not exist yet and information on which station belongs to which network for which period is typically not internationally shared.

The influence of inhomogeneities in the reference can be reduced by computing composite references over many stations, removing reference stations with breaks and by performing homogenisation iteratively.

A direct approach to solving this problem would be to simultaneously homogenise multiple stations, also called joint detection. A step in this direction are pairwise homogenisation methods where breaks are detected in the pairs. This requires an additional attribution step, which attributes the breaks to a specific station. Currently this is done by hand (for PRODIGE; Caussinus and Mestre, 2004; Rustemeier et al., 2017) or with ad-hoc rules (by the Pairwise homogenisation algorithm of NOAA; Menne and Williams, 2009).

In the homogenisation method HOMER (Mestre et al., 2013) a first attempt is made to homogenise all pairs simultaneously using a joint detection method from bio-statistics. Feedback from first users suggests that this method should not be used automatically. It should be studied how good this methods works and where the problems come from.

Multiple breakpoint methods are more accurate as single breakpoint methods. This expected higher accuracy is founded on theory (Hawkins, 1972). In addition, in the HOME benchmarking study it was numerically found that modern homogenisation methods, which take the multiple breakpoint and the inhomogeneous reference problems into account, are about a factor two more accurate as traditional methods (Venema et al., 2012).

However, the current version of CLIMATOL applies single-breakpoint detection tests, first SNHT detection on a window then splitting, to achieve results comparable to modern multiple-breakpoint methods with respect to break detection and homogeneity of the data (Killick, 2016). This suggests that the multiple-breakpoint detection principle may not be as important as previously thought and warrants deeper study or the accuracy of CLIMATOL is partly due to an unknown unknown.

The signal to noise ratio is paramount for the reliable detection of breaks. It would thus be valuable to develop statistical methods that explain part of the variance of a difference time series and remove this to see breaks more clearly. Data from (regional) reanalysis could be useful predictors for this.

First methods have been published to detect breaks for daily data (Toreti et al., 2012; Rienzner and Gandolfi, 2013). It has not been studied yet what the optimal resolution for breaks detection is (daily, monthly, annual), nor what the optimal way is to handle the seasonal cycle in the climate data and exploit the seasonal cycle of inhomogeneities. In the daily temperature benchmarking study of Killick (2016) most non-specialised detection methods performed better than the daily detection method MAC-D (Rienzner and Gandolfi, 2013).

The selection of appropriate reference stations is a necessary step for accurate detection and correction. Many different methods and metrics are used for the station selection, but studies on the optimal method are missing. The knowledge of local climatologists which stations have a similar regional climate needs to be made objective so that it can be applied automatically (at larger scales).

For detection a high signal to noise ratio is most important, while for correction it is paramount that all stations are in the same climatic region. Typically the same networks are used for both detection and correction, but it should be investigated whether a smaller network for correction would be beneficial. Also in general, we need more research on understanding the performance of (monthly and daily) correction methods.

Computing uncertainties

  • Also after homogenisation uncertainties remain in the data due to various problems: Not all breaks in the candidate station have been and can be detected.

  • False alarms are an unavoidable trade-off for detecting many real breaks.

  • Uncertainty in the estimation of correction parameters due to limited data.

  • Uncertainties in the corrections due to limited information on the break positions.

From validation and benchmarking studies we have a reasonable idea about the remaining uncertainties that one can expect in the homogenised data, at least with respect to changes in the long-term mean temperature. For many other variables and changes in the distribution of (sub-)daily temperature data individual developers have validated their methods, but systematic validation and comparison studies are still missing.

Furthermore, such studies only provide a general uncertainty level, whereas more detailed information for every single station/region and period would be valuable. The uncertainties will strongly depend on the signal to noise ratios, on the statistical properties of the inhomogeneities of the raw data and on the quality and cross-correlations of the reference stations. All of which vary strongly per station, region and period.

Communicating such a complicated errors structure, which is mainly temporal, but also partially spatial, is a problem in itself. Furthermore, not only the uncertainty in the means should be considered, but, especially for daily data, uncertainties in the complete probability density function need to be estimated and communicated. This could be communicated with an ensemble of possible realisations, similar to Brohan et al. (2006).

An analytic understanding of the uncertainties is important, but is often limited to idealised cases. Thus also numerical validation studies, such as the past HOME and upcoming ISTI studies are important for an assessment of homogenisation algorithms under realistic conditions.

Creating validation datasets also help to see the limits of our understanding of the statistical properties of the break signal. This is especially the case for variables other than temperature and for daily and (sub-)daily data. Information is needed on the real break frequencies and size distributions, but also their auto-correlations and cross-correlations, as well as explained in the next section the stochastic nature of breaks in the variability around the mean.

Validation studies focussed on difficult cases would be valuable for a better understanding. For example, sparse networks, isolated island networks, large spatial trend gradients and strong decadal variability in the difference series of nearby stations (for example, due to El Nino in complex mountainous regions).

The advantage of simulated data is that it can create a large number of quite realistic complete networks. For daily data it will remain hard for the years to come to determine how to generate a realistic validation dataset. Thus even if using parallel measurements is mostly limited to one break per test, it does provide the highest degree of realism for this one break.

Deterministic or stochastic corrections?

Annual and monthly data is normally used to study trends and variability in the mean state of the atmosphere. Consequently, typically only the mean is adjusted by homogenisation. Daily data, on the other hand is used to study climatic changes in weather variability, severe weather and extremes. Consequently, not only the mean should be corrected, but the full probability distribution describing the variability of the weather.

The physics of the problem suggests that many inhomogeneities are caused by stochastic processes. An example affecting many instruments are differences in the response time of instruments, which can lead to differences determined by turbulence. A fast thermometer will on average read higher maximum temperatures than a slow one, but this difference will be variable and sometimes be much higher than the average. In case of errors due to insolation the radiation error will be modulated by clouds. An insufficiently shielded thermometer will need larger corrections on warm days, which will typically be more sunny, but some warm days will be cloudy and not need much correction, while other warm days are sunny and calm and have a dry hot surface. The adjustment of daily data for studies on changes in the variability is thus a distribution problem and not only a regression bias-correction problem. For data assimilation (numerical weather prediction) accurate bias correction (with regression methods) is probably the main concern.

Seen as a variability problem, the correction of daily data is similar to statistical downscaling in many ways. Both methodologies aim to produce bias-corrected data with the right variability, taking into account the local climate and large-scale circulation. One lesson from statistical downscaling is that increasing the variance of a time series deterministically by multiplication with a fraction, called inflation, is the wrong approach and that the variance that could not be explained by regression using predictors should be added stochastically as noise instead (Von Storch, 1999). Maraun (2013) demonstrated that the inflation problem also exists for the deterministic Quantile Matching method, which is also used in daily homogenisation. Current statistical correction methods deterministically change the daily temperature distribution and do not stochastically add noise.

Transferring ideas from downscaling to daily homogenisation is likely fruitful to develop such stochastic variability correction methods. For example, predictor selection methods from downscaling could be useful. Both fields require powerful and robust (time invariant) predictors. Multi-site statistical downscaling techniques aim at reproducing the auto- and cross-correlations between stations (Maraun et al., 2010), which may be interesting for homogenisation as well.

The daily temperature benchmarking study of Rachel Killick (2016) suggests that current daily correction methods are not able to improve the distribution much. There is a pressing need for more research on this topic. However, these methods likely also performed less well because they were used together with detection methods with a much lower hit rate than the comparison methods.

The deterministic correction methods may not lead to severe errors in homogenisation, that should still be studied, but stochastic methods that implement the corrections by adding noise would at least theoretically fit better to the problem. Such stochastic corrections are not trivial and should have the right variability on all temporal and spatial scales.

It should be studied whether it may be better to only detect the dates of break inhomogeneities and perform the analysis on the homogeneous subperiods (removing the need for corrections). The disadvantage of this approach is that most of the trend variance is in the difference in the mean of the HSPs and only a small part is in the trend within the HPSs. In case of trend analysis, this would be similar to the work of the Berkeley Earth Surface Temperature group on the mean temperature signal. Periods with gradual inhomogeneities, e.g., due to urbanisation, would have to be detected and excluded from such an analysis.

An outstanding problem is that current variability correction methods have only been developed for break inhomogeneities, methods for gradual ones are still missing. In homogenisation of the mean of annual and monthly data, gradual inhomogeneities are successfully removed by implementing multiple small breaks in the same direction. However, as daily data is used to study changes in the distribution, this may not be appropriate for daily data as it could produce larger deviations near the breaks. Furthermore, changing the variance in data with a trend can be problematic (Von Storch, 1999).

At the moment most daily correction methods correct the breaks one after another. In monthly homogenisation it is found that correcting all breaks simultaneously (Caussinus and Mestre, 2004) is more accurate (Domonkos et al., 2013). It is thus likely worthwhile to develop multiple breakpoint correction methods for daily data as well.

Finally, current daily correction methods rely on previously detected breaks and assume that the homogeneous subperiods (HSP) are homogeneous (i.e., each segment between breakpoints assume to be homogeneous) . However, these HSP are currently based on detection of breaks in the mean only. Breaks in higher moments may thus still be present in the "homogeneous" sub periods and affect the corrections. If only for this reason, we should also work on detection of breaks in the distribution.

Correction as model selection problem

The number of degrees of freedom (DOF) of the various correction methods varies widely. From just one degree of freedom for annual corrections of the means, to 12 degrees of freedom for monthly correction of the means, to 40 for decile corrections applied to every season, to a large number of DOF for quantile or percentile matching.

A study using PRODIGE on the HOME benchmark suggested that for typical European networks monthly adjustment are best for temperature; annual corrections are probably less accurate because they fail to account for changes in seasonal cycle due to inhomogeneities. For precipitation annual corrections were most accurate; monthly corrections were likely less accurate because the data was too noisy to estimate the 12 correction constants/degrees of freedom.

What is the best correction method depends on the characteristics of the inhomogeneity. For a calibration problem just the annual mean could be sufficient, for a serious exposure problem (e.g., insolation of the instrument) a seasonal cycle in the monthly corrections may be expected and the full distribution of the daily temperatures may need to be adjusted. The best correction method also depends on the reference. Whether the variables of a certain correction model can be reliably estimated depends on how well-correlated the neighbouring reference stations are.

An entire regional network is typically homogenised with the same correction method, while the optimal correction method will depend on the characteristics of each individual break and on the quality of the reference. These will vary from station to station, from break to break and from period to period. Work on correction methods that objectively select the optimal correction method, e.g., using an information criterion, would be valuable.

In case of (sub-)daily data, the options to select from become even larger. Daily data can be corrected just for inhomogeneities in the mean (e.g., Vincent et al., 2002, where daily temperatures are corrected by incorporating a linear interpolation scheme that preserves the previously defined monthly corrections) or also for the variability around the mean. In between are methods that adjust for the distribution including the seasonal cycle, which dominates the variability and is thus effectively similar to mean adjustments with a seasonal cycle. Correction methods of intermediate complexity with more than one, but less than 10 degrees of freedom would fill a gap and allow for more flexibility in selecting the optimal correction model.

When applying these methods (Della-Marta and Wanner, 2006; Wang et al., 2010; Mestre et al., 2011; Trewin, 2013) the number of quantile bins (categories) needs to be selected as well as whether to use physical weather-dependent predictors and the functional form they are used (Auchmann and Brönnimann, 2012). Objective optimal methods for these selections would be valuable.

Related information

WMO Guidelines on Homogenization (English, French, Spanish) 

WMO guidance report: Challenges in the Transition from Conventional to Automatic Meteorological Observing Networks for Long-term Climate Records

Sunday, 30 August 2020

A primer on herd immunity for Social Darwinists

Herd immunity has been proposed as a way to deal with the new Corona virus. In the best case, it is a call to slow down the spread of the SARS-CoV-2 virus enough so that the hospitals can just handle the flood of patients, in the worst case it is a fancy term for just doing nothing and let everyone get sick. 

Trump often talked about letting the pandemic wash over America. Insider reports confirm he was indeed talking about herd immunity. When I first heard claims that Boris Johnson was pursuing herd immunity, I assumed his political opponents were smearing him and trying to get him to act, but it seems as if this was really his plan. Of all world leaders Jair Bolsonaro may be most in denial about the pandemic, which he calls a little flu. He also advocated herd immunity. All these leaders have downplayed the threat, which by itself helps spread the decease, and advocated policies that promote infection, leading to more infected, sick and dead people.

America and Brazil lead the COVID-19 death rankings unchallenged with respectively 187 and 120 thousand total deaths and around one thousand people dying every day the last month. The UK is the country with the most COVID-19 deaths in Europe, while it was lucky to get it late.

This "strategy" has a certain popularity among Trump-like politicians. I do not think they know what they are doing. Scientific advice tends to come from a humanist perspective where every life is valued. Such advice is naturally rejected by Social Darwinians, who in the best case do not care about most people. While these politicians naturally see themselves as more valuable than us, they tend not to excel in academics. So let me explain why herd immunity is also a bad policy from their perspective, even if up to now people from groups they hate had a higher risk of dying.

Herd immunity

If we do not take any preventative measures one SARS-CoV-2 infected person infects two or three further people. This may not sound like much, but this is an example of exponential growth. It is the same situation of the craftsman who "only" asks the king for rice as payment for his chessboard: one grain on the first square, two on the second, four on the third square and so on.

If we assume that one person only infects two other people, that is that the base reproduction number is two, then the sequence is: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, ...
536,870,912, ...

Those are just 30 steps to get to half a billion and the steps take about 5 days in case of SARS-CoV-2. So that is half a year. With good care 1 to 2 percent of people die, that would be at this point 5 to 10 million people. Do be highly optimistic and it is still 2 million people.

Much more people need to got the hospital. In Germany this is 17%. A recent French study reported that after 110 day most patients are still tired and have trouble breathing, many did not yet work again. That would be around 200 million people with long-lasting health problems.

This will naturally not happen in reality. People will take action to reduce the reproduction number, whether the government mandates it or not. And at a certain moment an infected person will not infect as many people because many are already immune. If the base reproduction number is two and half the population is immune, the infected person will only infect one other person, that is the effective reproduction number is one.

The actual base reproduction number is most likely larger than two and reality is more complicated, so experts estimates that the actual herd immunity level is not 50%, but between 60 and 70%. More complication is that it is possible that people are sufficiently immune to avoid getting ill again, but the immunity may not prevent people from getting infected and transmitting the virus. There is a strong case of a 33 year old man from Hong Kong, who got infected twice, but did not get ill. If this were typical, herd immunity would not exist.

You may have heard experts say that once this immunity level has been reached, that the pandemic is over. But this does not mean that the virus is gone. Europe needed several months of an effective reproduction number well below one to get to low infection numbers (and the virus is still not gone). This was after a drastic decrease in the effective reproduction number (R) due to public health measures, in case of herd immunity it would initially be around one, and only very slowly go below one.

Say that when we reach R = 1 when one million people are infected, the after one step later (5 days) another one million people are infected. One million of 30% of the world population is not much. So also R will still be almost one. In other words, it would take several years for the virus to go away even in the best case. In the worst case, the virus mutates, people lose some immunity and new babies are immunologically naive. So most likely SARS-CoV-2 would stay with us forever.

Reaching herd immunity will not help Trump. He will still be bunkered down in the White House surrounded by staff that is tested every day so as not to infect him, while the calls on others do go out, without a good testing system, and die for him. Trump's billionaire buddies will still need to lock themselves up in their mansions or high-sea yachts, counting how much richer they got from Trump's COVID-19 bailout. The millionaire hosts at Fox News will stay at home telling others to go out to work even if their work place is not safe and to accelerate the pandemic by sending kids to schools even it the schools are not safe. They will still need to wait until there is a vaccine. The herd immunity strategy only ensures that up to that time the largest number of people have died.

When the virus is everywhere, good luck trying to keep it out of elderly homes. In 2016 Trump won in the age groups above 50. The UK Conservatives had a 47 point lead among those aged 65+. In Brazil Bolsonaro had a 16 percent point lead for people older than 60. They will be the ones dying and seeing their friends die. This is not helpful for the popular support of far right politicians.

The elite may think that it will be the poorest 70% that get infected. Far right Republicans may hope that it will affect Democrats and people of color more. It is true that at the moment poor people are more affected as they cannot afford staying at home even if their place of work is not safe. It is true that initially mostly blue states and cities were affected in America.

Let's take the theoretical case where the poorest 70% are infected or immune and the richest 30% still immunologically naive. As soon as one of these 30% are infected, it will spread like wild fire as rich people tend to hang out with rich people, so the virus would easily find two or three rich people to infect next.

That is one reason why it is too simple to equate a base reproduction number with a herd immunity of 50%. This would be the case if the population were perfectly mixed. But any network were the immunity level is not yet 50% is up for grabs. In the end everyone will get it, rich or poor, red or blue.

The only Social Darwinists for whom this pays are billionaires who have their own private hospital, with their own nurses, doctors at their mansion. They would have a chance of 1 to 2 percent to die. While if they manage to convince the people to go for herd immunity and not even to stay below the carrying capacity of the hospitals around 5% of the population would die. That is a 2 to 3% survival difference. Not sure that is worth getting all your politicians kicked out of office.

It naturally also helps the high frequency traders. Like the Mercer family who funded Trump in 2016 when no one thought he was a good investment. They have made so much money from the chaos Trump produces. Up or down the high frequency trader wins. Down goes faster. They live their lives on chaos, suffering and destruction. I presume they have a private hospital, they have the money.

But for the average Joe Social Darwinist there are nearly no gains and it is bad politics. It hurts your country compared to more social democratic countries and at home it helps lefties get into power and implement disgusting policies that help everyone.

Related reading

Sunday, 26 July 2020

Micro-blogging for scientists without nasties and surveillance

Start screen picture of Mastodon: A Mastodon playing with paper airplanes.

Two years ago I joined Mastodon to get to know a more diverse group of people here in Bonn. Almost two thousand messages later, I can say I really like it there and social networks like Mastodon are much more healthy for society as well. Together with Frank Sonntag we have recently set up a Mastodon server for publishing scientists. Let me explain how it works and why this system is better for the users and society.

Mastodon looks a lot like Twitter, i.e. it is a micro-blogging system, but many tweaks make it a much friendlier place where you can have meaningful conversations. One exemplary difference is that there are no quote tweets. Quoting rather than simply replying is often used by large accounts to bully small ones by pulling in many people into the "conversation" who disagree. I do miss quote tweets, they can also be used for good, to highlight what is interesting about a tweet or to explain something that the writer assumed their readers know, but your readers may not know. But quote tweets make the atmosphere more adversarial, less about understanding and talking with each other. Conflict leads to more engagement and more time on the social network, so Twitter and Facebook like it, but pitting groups against each other is not the public debate that makes humanity better.

The main difference under the hood is that the system is not controlled by one corporation. There is not one server, but many servers that seamlessly talk with each other, just like the email system. The communication protocol (ActivityPub) is a standard of the World Wide Web Consortium, just like HTML and HTTPS, which powers the web.

This means that you can chose the server and interface you like and still talk to others, while people on Twitter, Facebook, Instagram, WordPress and Tumblr can only talk to other people in their silo. As they say the modern internet is a group of five websites, each consisting of screenshots from the other four. It is hard to leave these silos, it would cut you off from your friends. This is also why the system naturally evolves into a few major players. Their service is as bad as one would expect with the monopoly power this network effect gives them.

The Fediverse and its soial networks as icons

ActivityPub is not only used by Mastodon, but also by other micro-blogging social networks such as Pleroma, blogging networks such as, podcasting services such as FunkWhale and file hosting such as NextCloud. There is a version of Instagram (PixelFed) and of YouTube (PeerTube). With ActivityPub all these social networks can talk to each other. Where they do different things, the system is designed to degrade gracefully. FixelFed shows photos more beautifully, has collections and filters, but Mastodon gracefully shows the recent photos as a photo below a message. PeerTube shows one large video on a page, just like Twitter, Mastodon shows the newest videos in small below a message in the news feed. The full network is called the fediverse, a portmanteau of federation and universe.

Currently all these services are ad-free and tracking-free. The coding of the open source software is largely a labor of love, even if some coders are supported by micro-funding, for example Patreon or Liberapay. Most servers are maintained by people as hobby, some (like for email) by organization for their members, some larger ones again use Patreon or Liberapay, some are even coops.

This means that technology enthusiasts from the middle class are mostly behind these networks. That is better than a few large ad corporations, but still not as democratic as one would like for such an important part of our society.


Not only can these networks talk to each other, they also themselves consist of many different servers each maintained by another group, just as the email system. This means that moderation of the content is much better than on Twitter or Facebook. The owners of the servers want to create a functional community, while these communities are relatively small. So they can invest much more time per moderation decision than a commercial silo would. Also if the moderation fails, people will go somewhere else.

Individual moderation decisions only pertain one server and are thus less impactful and can consequently be more forceful. If you do not like the moderation, you can move to another server that fits your values better. If you are kicked off a server, you can go to another one and still talk to your friends. Facebook kicking someone off Facebook or Twitter kicking someone off Twitter is somewhat of a big deal and is thus only done in extreme cases, when someone already created a lot of damage to the social fabric, while others make the atmosphere toxic staying below the radar.

If someone is really annoying they may naturally be removed from many servers. Then it does become a problem for this person, but that only happens when many server administrator agree you are not welcome. So maybe that person is really not an enrichment for humanity.

The extreme example would be Nazis. Some Nazis were too extreme for Twitter and started their own micro-blogging network. Probably most Nazis know the name already, but I think it is a good policy not to help bad actors with PR. As this network was used to coordinate their violent and inhumane actions, Google and Apple have removed their apps from their app stores. I may like that outcome, but these corporations should not have that power. Next this network started using ActivityPub, so that they can use ActivityPub apps. The main Activity network does not like Nazis, so they all blocked this network.

I feel this is a good solution for society, everyone has their freedom of speech, but Nazis cannot harass decent people. They can tell each other pretty lies, where being responsible for killing more than 138 thousand Americans is patriotism, but 4 is treason, where the state brutalizing people expressing their 1st amendment rights is freedom, but wearing a mask not to risk the lives of others is tyranny. At least we do not have to listen to the insanity. (The police should naturally listen to stop crime.)

Many of the societal problems of Facebook and Co. would be much reduced if we would legislate that such large networks open up to competition by implementing open communication protocols like ActivityPub. Then they would be forced to deliver a good product to keep their customers. If they do not change many will flee the repulsive violent conspiracy surveilance hell they were only still part of to be able to talk to grandma.

Because there are nearly no Nazis and other unfriendly characters, the fediverse is very popular with groups they would otherwise harass and bully into silence. It is a colorful bunch. This illustrates that extending the right to free speech to the right to be amplified by others does not optimize the freedom of speech, but in reality excludes many voices.

A short encore: the coders of the ActivityPub apps also do not like Nazis. So they hard coded Nazi blocks into their apps. It is open source software, so the Nazis can remove this, but Google and Apple will not accept their apps. The latter is the societal problem, the coders are perfectly in their right not to want their work be used to destroy civilization.

Open Science

The fediverse looks a lot like the Open Science tool universe I am dreaming of. Many independent groups and servers that seamlessly communicate with each other. The Grassroots post-publication peer review system I am working on should be able to gather reviews from all the other review and endorsement systems. They and repositories should be able to display grassroots reviews.

The reviews could be aided by displaying information on retractions from the Retraction Watch database. I hope someone will build a service that also warns when a cited article is retracted. The review could show or link to open citations of the article and statistics checks, as well as plagiarism and figure tampering checks.

We could have systems that warn authors of new articles and manuscripts they may find interesting given their publication history and warn editors of manuscripts that fit to their journal. I recently made a longer list of useful integrations and services and put it on Zenodo.

These could all be independent services that work together via ActivityPub and APIs, but the legacy publishers are working on collaborative science pipelines that create network effects, to ensure you are forced to use the largest service where you colleagues are and cannot leave, just like Facebook, Google and Twitter.


A mastodon with a paperplane in its trunk.
I am explaining all this to illustrate that such a federated social network is much better for society and its users. I really like the atmosphere on Mastodon. You can have real conversations with interesting people, without lunatics jumping in between or groups being pitted against each other. If people hear less and less of me on Twitter, that is one of the reasons.

So I hope that this kind of network is the future and to help getting there we have started a Mastodon server for publishing scientists. "We" is me and former meteorologist Frank Sonntag who leads a small digital services company, AKM-services. So for him setting up a Mastodon server was easy.

Two years ago he had to drag me to Mastodon a bit, when we tried to set up a server just for the Earth Sciences. That did not work out. By now that I have learned to love Mastodon, it has gotten a lot bigger and more people are aware of the societal problems due to social media. So it is time for another try with a larger target audience, all scientists. We have called it: FediScience.

Mastodon is still quite small with about half a million active users; Twitter is 100 times bigger. My impression is that at least many climate scientists are on Twitter for science communication. For many leaving Twitter is not yet a realistic option, but FediScience could be a friendly place to talk to colleagues, nerd out about detailed science, while staying on Twitter for more comprehensible Tweets on the main findings.

Once we have a nice group together, say after the boreal summer holidays, we can together decide on the local rules. How we would like to moderate, who will do the moderation, with whom our server federates, who is welcome, how long the messages are, whether we want equations, ...

My network empire

My solution to Mastodon still being small was to stay on Twitter to talk about climate science, the political problems leading to the climate branch of the American culture war and anything that comes up on this blog: Variable Variability. As the goal of my Mastodon account in Bonn is to build a local network for a digital non-profit, there I talk about the open web, data privacy more, often write in German and only occasionally write about climate. I aim to use my new account at FediScience to talk about (open) science and to enjoy finally a captive audience that understands the statistics of variability. As administrator I will try to help people find their way in the fediverse.

Next to this the grassroots open review journals are on Mastodon, Twitter and Reddit. And I have inherited the Open Science Feed from Jon Tennant, which is on Mastodon, Twitter and Reddit. Both deserve to get an IndieWeb homepage and a newsletter, but all newsletters I know are full of trackers, suggestions for ethical ones are welcome. For even more fun, I also created a Twitter feed for the climate statistics blog Tamino and scientific skeptic Potholer54's YouTube channel. I should probably put them on Mastodon as well. That makes this blog my 12th social media channel. Pro-tip: with Firefox "containers" you can be logged in into multiple Mastodon, Twitter or Reddit accounts.

Every member of FediScience can invite their colleagues to join the network. Please do. This is my invitation link for you. (If you share the link in public, please make it time limited.)

Please let other scientists know about FediScience, whether by mail or via one of the social media silos. These are good Tweets to spread:


When you join Mastodon, the following glossary is helpful.

The Bird Site Twitter
Fediverse All federated social media sites together
Instance Sever running Mastodon
Toot Tweet
Boost Retweet
ActivityPub (AP) The main communication protocol in the fediverse
Content Warning (CW)A convenient way to give a heads up A mirror site of Twitter without tracking, popular for linking to in Mastodon


Friday, 29 May 2020

What does statistical homogenization tell us about the underestimated global warming over land?

Climate station data contains inhomogeneities, which are detected and corrected by comparing a candidate station to its neighbouring reference stations. The most important inhomogeneities are the ones that lead to errors in the station network-wide trends and in global trend estimates. 

An earlier post in this series argued that statistical homogenization will tend to under-correct errors in the network-wide trends in the raw data. Simply put: that some of the trend error will remain. The catalyst for this series is the new finding that when the signal to noise ratio is too low, homogenization methods will have large errors in the positions of the jumps/breaks. For much of the earlier data and for networks in poorer countries this probably means that any trend errors will be seriously under-corrected, if they are corrected at all.

The questions for this post are: 1) What do the corrections in global temperature datasets do to the global trend and 2) What can we learn from these adjustments for global warming estimates?

The global warming trend estimate

In the global temperature station datasets statistical homogenization leads to larger warming estimates. So as we tend to underestimate how much correction is needed, this suggests that the Earth warmed up more than current estimates indicate.

Below is the warming estimate in NOAA’s Global Historical Climate Network (Versions 3 and 4) from Menne et al. (2018). You see the warming in the “raw data” (before homogenization; striped lines) and in the homogenized data (drawn line). The new version 4 is drawn in black, the previous version 3 in red. For both versions homogenization makes the estimated warming larger.

After homogenization the warming estimates of the two versions are quite similar. The difference is in the raw data. Version 4 is based on the raw data of the International Surface Temperature Initiative and has much more stations. Version 3 had many stations that report automatically, these are typically professional stations and a considerable part of them are at airports. One reason the raw data may show less warming in Version 3 is that many stations at airports were in cities before. Taking them out of the urban heat island and often also improving the local siting of the station, may have produced a systematic artificial cooling in the raw observations.

Version 4 has more stations and thus a higher signal to noise ratio. One may thus expect it to show more warming. That this is not the case is a first hint that the situation is not that simple, as explained at the end of this post.

Figure from Menne et al. with warming estimates from 1880. See caption below.
The global land warming estimates based on the Global Historical Climate Network dataset of NOAA. The red lines are for version 3, the black lines for the new version 4. The striped lines are before homogenization and the drawn lines after homogenization. Figure from Menne et al. (2018).

The difference due to homogenization in the global warming estimates is shown in the figure below, also from Menne et al. (2018). The study also added an estimate for the data of the Berkeley Earth initiative.

(Background information. Berkeley Earth started as a US Culture War initiative where non-climatologists computed the observed global warming. Before the results were in, climate “sceptics” claimed their methods were the best and they would accept any outcome. The moment the results turned out to be scientifically correct, but not politically correct, the climate “sceptics” dropped them like a hot potato.)

We can read from the figure that in GHCNv3 over the full period homogenization increases warming estimates by about 0.3 °C per century, while this is 0.2°C in GHCNv4 and 0.1°C in the dataset of Berkeley Earth datasets. GHCNv3 has more than 7000 stations (Lawrimore et al., 2011). GHCNv4 is based on the ISTI dataset (Thorne et al., 2011), which has about 32,000 stations, but GHCN only uses those of at least 10 years and thus contains about 26,000 stations (Menne et al. 2018). Berkeley Earth is based on 35,000 stations (Rohde et al., 2013).

Figure from Menne et al. (2018) showing how much adjustments were made.
The difference due to homogenization in the global warming estimates (Menne et al., 2018). The red line is for smaller GHCNv3 dataset, the black line for GHCNv4 and the blue line for Berkeley Earth.

What does this mean for global warming estimates?

So, what can we learn from these adjustments for global warming estimates? At the moment, I am afraid, not yet a whole lot. However, the sign is quite likely right. If we could do a perfect homogenization, I expect that this would make the warming estimates larger. But to estimate how large the correction should have been based on the corrections which were actually made in the above datasets is difficult.

In the beginning, I was thinking: if the signal to noise ratio in some network is too low, we may be able to estimate that in such a case we under-correct, say, 50% and then make the adjustments unbiased by making them, say, twice as large.

However, especially doing this globally is a huge leap of faith.

The first assumption this would make is that the trend bias in data sparse regions and periods is the same as that of data rich regions and periods. However, the regions with high station density are in the [[mid-latitudes]] where atmospheric measurements are relatively easy. The data sparse periods are also the periods in which large changes in the instrumentation were made as we were still learning how to make good meteorological observations. So we cannot reliably extrapolate from data rich regions and periods to data sparse regions and periods. 

Furthermore, there will not be one correction factor to account for under-correction because the signal to noise ratio is different everywhere. Maybe America is only under-corrected by 10% and needs just a little nudge to make the trend correction unbiased. However, homogenization adjustments in data sparse regions may only be able to correct such a small part of the trend bias that correcting for the under-correction becomes adventurous or even will make trend estimates more uncertain. So we would at least need to make such computations for many regions and periods.

Finally, another reason not to take such an estimate too seriously are the spatial and temporal characteristics of the bias. The signal to noise ratio is not the only problem. One would expect that it also matters how the network-wide trend bias is distributed over the network. In case of relocations of city stations to airports, a small number of stations will have a large jump. Such a large jump is relatively easy to detect, especially as its neighbouring stations will mostly be unaffected.

Already a harder case is the time of observation bias in America, where a large part of the stations has experienced a cooling shift from afternoon to morning measurements over many decades. Here, in most cases the neighbouring stations were not affected around the same time, but the smaller shift makes it harder to detect these breaks.

(NOAA has a special correction for this problem, but when it is turned off statistical homogenization still finds the same network-wide trend. So for this kind of bias the network density in America is apparently sufficient.)

Among the hardest case are changes in the instrumentation. For example, the introduction of Automatic Weather Stations in the last decades or the introduction of the Stevenson screen a century ago. These relatively small breaks often happen over a period of only a few decades, if not years, which means that also the neighbouring stations are affected. That makes it hard to detect them in a difference time series.

Studying from the data how the biases are distributed is hard. One could study this by homogenizing the data and studying the breaks, but the ones which are difficult to detect will then be under-represented. This is a tough problem; please leave suggestions in the comments.

Because of how the biases are distributed it is perfectly possible that the trend biases corrected in GHCN and Berkley Earth are due to the easy-to-correct problems, such as the relocations to airports, while the hard ones, such as the transition to Stevenson screens, are hardly corrected. In this case, the correction that could be made, do not provide information on the ones that could not be made. They have different causes and different difficulties.

So if we had a network where the signal to noise ratio is around one, we could not say that the under-correction is, say, 50%. One would have to specify for which kind of distribution of the bias this is valid.

GHCNv3, GHCNv4 and Berkeley Earth

Coming back to the trend estimates of GHCN version 3 and version 4. One may have expected that version 4 is able to better correct trend biases, having more stations, and should thus show a larger trend than version 3. This would go even more so for Berkeley Earth. But the final trend estimates are quite similar. Similarly in the most data rich period after the second world war, the least corrections are made.

The datasets with the largest number of stations showing the strongest trend would have been a reasonable expectation if the trend estimates of the raw data would have been similar. But these raw data trends are the reason for the differences in the size of the corrections, while the trend estimates based on the homogenized are quite similar.

Many additional stations will be in regions and periods where we already had many stations and where the station density was no problem. On the other hand, adding some stations to data sparse regions may not be sufficient to fix the low signal to noise ratio. So the most improvements would be expected for the moderate cases where the signal to noise ratio is around one. Until we have global estimates of the signal to noise ratio for these datasets, we do not know for which percentage of stations this is relevant, but this could be relatively small.

The arguments of the previous section are also applicable here; the relationship between station density and adjustments may not be that easy. Especially that the corrections in the period after the second world war are so small is suspicious; we know quite a lot happened to the measurement networks. Maybe these effects all average out, but that would be quite a coincidence. Another possibility is that these changes in observational methods were made over relatively short periods to entire networks making it hard to correct them.

A reason for the similar outcomes for the homogenized data could be that all datasets successfully correct for trend biases due to problems like the transition to airports, while for every dataset the signal to noise ratio is not enough to correct problems like the transition to Stevenson screens. GHNCv4 and Berkeley Earth using as many stations as they could find could well have more stations which are currently badly sited than GHCNv3, which was more selective. In that case the smaller effective corrections of these two datasets would be due to compensating errors.

Finally, as small disclaimer: The main change from version 3 to 4 was the number of stations, but there were other small changes, so it is not just a comparison of two datasets where only the signal to noise ratio is different. Such a pure comparison still needs to be made. The homogenization methods of GHCN and Berkeley Earth are even more different.

My apologies for all the maybe's and could be's, but this is something that is more complicated than it may look and I would not be surprised if it will turn out to be impossible to estimate how much corrections are needed based on the corrections that are made by homogenization algorithms. The only thing I am confident about is that homogenization improves trend estimates, but I am not confident about how much it improves.

Parallel measurements

Another way to study these biases in the warming estimates is to go into the books and study station histories in 200 plus countries. This is basically how sea surface temperature records are homogenized. To do this for land stations is a much larger project due to the large number of countries and languages.

Still there are such experiments, which give a first estimate for some of the biases when it comes to the global mean temperature (do not expect regional detail). In the next post I will try to estimate the missing warming this way. We do not have much data from such experiments yet, but I expect that this will be the future.

Other posts in this series


Chimani, Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683.

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russel S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal of Geophysical Research, 116, D19121.

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: Article:

Rohde, Robert, Richard A. Muller, Robert Jacobsen, Elizabeth Muller, Saul Perlmutter, Arthur Rosenfeld, Jonathan Wurtele, Donald Groom and Charlotte Wickham, 2013: A New Estimate of the Average Earth Surface Land Temperature Spanning 1753 to 2011. Geoinformatics & Geostatistics: An Overview, 1, no.1.

Sutton, Rowan, Buwen Dong and Jonathan Gregory, 2007: Land/sea warming ratio in response to climate change: IPCC AR4 model results and comparison with observations. Geophysical Research Letters, 34, L02701.

Thorne, Peter W., Kate M. Willett, Rob J. Allan, Stephan Bojinski, John R. Christy, Nigel Fox, Simon Gilbert, Ian Jolliffe, John J. Kennedy, Elizabeth Kent, Albert Klein Tank, Jay Lawrimore, David E. Parker, Nick Rayner, Adrian Simmons, Lianchun Song, Peter A. Stott and Blair Trewin, 2011: Guiding the creation of a comprehensive surface temperature resource for twenty-first century climate science. Bulletin American Meteorological Society, 92, ES40–ES47.

Wallace, Craig and Manoj Joshi, 2018: Comparison of land–ocean warming ratios in updated observed records and CMIP5 climate models. Environmental Research Letters, 13, no. 114011. 

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116.

Friday, 1 May 2020

Statistical homogenization under-corrects any station network-wide trend biases

Photo of a station of the US Climate Reference Network with a prominent wind shield for the rain gauges.
A station of the US Climate Reference Network.

In the last blog post I made the argument that the statistical detection of breaks in climate station data has problems when the noise is larger than the break signal. The post before argued that the best homogenization correction method we have can remove network-wide trend biases perfectly if all breaks are known. In the light of the last post, we naturally would like to know how well this correction method can remove such biases in the more realistic case when the breaks are imperfectly estimated. That should still be studied much better, but it is interesting to discuss a number of other studies on the removal of network-wide trend biases from the perspective of this new understanding.

So this post will argue that it theoretically makes sense that (unavoidable) inaccuracies of break detection lead to network-wide trend biases only being partially corrected by statistical homogenization.

1) We have seen this in our study of the correction method in response to small errors in the break positions (Lindau and Venema, 2018).

2) The benchmarking study of NOAA’s homogenization algorithm shows that if the breaks are big and easy they are largely removed, while in the scenario where breaks are plentiful and small half of the trend bias remains (Williams et al., 2012).

3) Another benchmarking study show that with the network density of Switzerland homogenization can find and remove clear trend biases, while if you thin this network to be similar to Peru the bias cannot be removed (Gubler et al., 2017).

4) Finally, a benchmarking study of relative humidity station observations in Austria could not remove much of the trend bias, which is likely because relative humidity is not correlated well from station to station (Chimani et al., 2018).

Statistical homogenization on a global scale makes warming estimates larger (Lawrimore et al., 2011; Menne et al., 2018). Thus if it can only remove part of any trend bias, this would mean that quite likely the actual warming was larger.

Figure 1: The inserted versus remaining network-mean trend error. Upper panel for perfect breaks. Lower panel for a small perturbation of the break position. The time series are 100 annual values and have 5 break. Figure 10 in Lindau and Venema (2018).

Joint correction method

First, what did our study on the correction method (Lindau and Venema, 2018) say about the importance of errors in the break position? As the paper was mostly about perfect breaks, we assumed that all breaks were known, but that they had a small error in their position. In the example to the right, we perturbed the break position by a normally distributed random number with standard deviation one (lower panel), while for comparison the breaks are perfect (upper panel).

In both cases we inserted a large network-wide trend bias of 0.873 °C over the length of the century long time series. The inserted errors for 1000 simulations is on the x-axis, the average inserted trend bias is denoted by x̅. The remaining error after homogenization is on the y-axis. Its average is denoted by y̅ and basically zero in case the breaks are perfect (top panel). In case of the small perturbation (lower panel) the average remaining error is 0.093 °C, this is 11 % of the inserted trend bias. That is the under-correction for is a quite small perturbation: 38 % of the positions is not changed at all.

If the standard deviation of the position perturbation is increased to 2, the remaining trend bias is 21 % of the inserted bias.

In the upper panel, there is basically no correlation between the inserted and the remaining error. That is, the remaining error does not depend on the break signal, but only on the noise. In the lower panel with the position errors, there is a correlation between the inserted and remaining trend error. So in this more realistic case, it does matter how large the trend bias due to the inhomogeneities is.

This is naturally an idealized case, position errors will be more complicated in reality and there would be spurious and missing breaks. But this idealized case fitted best to the aim of the paper of studying the correction algorithm in isolation.

It helps understand where the problem lies. The correction algorithm is basically a regression that aims to explain the inserted break signal (and the regional climate signal). Errors in the predictors will lead to an explained variance that is less than 100 %. One should thus expect that the estimated break signal is smaller than the actual break signal. It is thus expected that the trend change due to the estimated break signal produces is smaller than the actual trend change due to the inhomogeneities.

NOAA’s benchmark

That statistical homogenization under-corrects when the going gets tough is also found by the benchmarking study of NOAA’s Pairwise Homogenization Algorithm in Williams et al. (2012). They simulated temperature networks like the American USHCN network and added inhomogeneities according to a range of scenarios. (Also with various climate change signals.) Some scenarios were relatively easy, had few and large breaks, while others were hard and contained many small breaks. The easy cases were corrected nearly perfectly with respect to the network-wide trend, while in the hard cases only half of the inserted network-wide trend error was removed.

The results of this benchmarking for the three scenarios with a network-wide trend bias are shown below. The three panels are for the three scenarios. Each panel has results (the crosses, ignore the box plots) for three periods over which the trend error was computed. The main message is that the homogenized data (orange crosses) lies between the inhomogeneous data (red crosses) and the homogeneous data (green crosses). Put differently, green is how much the climate actually changed, red is how much the estimate is wrong due to inhomogeneities, orange shows that homogenization moves the estimate towards the truth, but never fully gets there.

If we use the number of breaks and their average size as a proxy for the difficulty of the scenario, the one on the left has 6.4 breaks with an average size of 0.8 °C, the one in the middle 8.4 breaks (size 0.4 °C) and the one on the right 10 breaks (size 0.4 °C). So this suggests there is a clear dose effect relationship; although there surely is more than just the number of breaks.

Figures from Williams et al. (2012) showing the results for three scenarios. This is a figure I created from parts of Figure 7 (left), Figure 5 (middle) and Figure 10 (right; their numbers).

When this study appeared in 2012, I found the scenario with the many small breaks much too pessimistic. However, our recent study estimating the properties of the inhomogeneities of the American network found a surprisingly large number of breaks: more than 17 per century; they were bigger: 0.5 °C. So purely based on the number of breaks the hardest scenario is even optimistic, but also size matters.

Not that I would already like to claim that even in a dense network like the American there is a large remaining trend bias and the actual warming was much larger. There is more to the difficulty of inhomogeneities than their number and size. It sure is worth studying.

Alpine benchmarks

The other two examples in the literature I know of are examples of under-correction in the sense of basically no correction because the problem is simply too hard. Gubler et al. (2017) shows that the raw data of the Swiss temperature network has a clear trend bias, which can be corrected with homogenization of its dense network (together with metadata), but when they thin the network to a network density similar to that of Peru, they are unable to correct this trend bias. For more details see my review of this article in the Grassroots Review Journal on Homogenization.

Finally, Chimani et al. (2018) study the homogenization of daily relative humidity observations in Austria. I made a beautiful daily benchmark dataset, it was a lot of fun: on a daily scale you have autocorrelations and a distribution with an upper and lower limit, which need to be respected by the homogeneous data and the inhomogeneous data. But already the normal homogenization of monthly averages was much too hard.

Austria has quite a dense network, but relative humidity is much influenced by very local circumstances and does not correlate well from station to station. My co-authors of the Austrian weather service wanted to write about the improvements: "an improvement of the data by homogenization was non‐ideal for all methods used". For me the interesting finding was: nearly no improvement was possible. That was unexpected. Had we expected that we could have generated a much simpler monthly or annual benchmark to show no real improvement was possible for humidity data and saved us a lot of (fun) work.

What does this mean for global warming estimates?

When statistical homogenization only partially removes large-scale trend biases what does this mean for global warming estimates? In the global temperature datasets statistical homogenization leads to larger warming estimates. So if we tend to underestimate how much correction is needed, this would mean that the Earth most likely warmed up more than current estimates indicate. How much exactly is hard to tell at the moment and thus needs a nuanced discussion. Let me give you my considerations in the next post.

Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization


Chimani Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683.

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russell S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal Geophysical Research, 116, D19121.

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript:, paywalled article:

Menne, Matthew J., Claude N. Williams, Byron E. Gleason, Jared J. Rennie and Jay H. Lawrimore, 2018: The Global Historical Climatology Network Monthly Temperature Dataset, Version 4. Journal of Climate, 31, 9835–9854.

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116.

Monday, 27 April 2020

Break detection is deceptive when the noise is larger than the break signal

I am disappointed in science. It is impossible that it took this long for us to discover that break detection has serious problems when the signal to noise ratio is low. However, as far as we can judge this was new science and it certainly was not common knowledge, which it should have been because it has large consequences.

This post describes a paper by Ralf Lindau and me about how break detection depends on the signal to noise ratio (Lindau and Venema, 2018). The signal in this case are the breaks we would like to detect. These breaks could be from a change in instrument or location of the station. We detect breaks by comparing a candidate station to a reference. This reference can be one other neighbouring station or an average of neighbouring stations. The candidate and reference should be sufficiently close so that they have the same regional climate signal, which is then removed by subtracting the reference from the candidate. The difference time series that is left contains breaks and noise because of measurement uncertainties and differences in local weather. The noise thus depends on the quality of the measurements, on the density of the measurement network and on how variable the weather is spatially.

The signal to noise ratio (SNR) is simply defined as the standard deviation of the time series containing only the breaks divided by the standard deviation of time series containing only the noise. For short I will denote these as the break signal and the noise signal, which have a break variance and a noise variance. When generating data to test homogenization algorithms, you know exactly how strong the break signal and the noise signal is. In case of real data, you can estimate it, for example with the methods I described in a previous blog post. In that study, we found a signal to noise ratio for annual temperature averages observed in Germany of 3 to 4 and in America of about 5.

Temperature is studied a lot and much of the work on homogenization takes place in Europe and America. Here this signal to noise ratio is high enough. That may be one reason why climatologists did not find this problem sooner. Many other sciences use similar methods, we are all supported by a considerable statistical literature. I have no idea what their excuses are.

Why a low SNR is a problem

As scientific papers go, the discussion is quite mathematical, but the basic problem is relatively easy to explain in words. In statistical homogenization we do not know in advance where the break or breaks will be. So we basically try many break positions and search for the break positions that result in the largest breaks (or, for the algorithm we studied, that explain the most variance).

If you do this for a time series that contains only noise, this will also produce (small) breaks. For example, in case you are looking for one break, due to pure chance there will be a difference between the averages of the first and the last segment. This difference is larger than it would be for a predetermined break position, as we try all possible break positions and then select the one with the largest difference. To determine whether the breaks we found are real, we require that they are so large that it is unlikely that they are due to chance, while there are actually no breaks in the series. So we study how large breaks are in series that only contains noise to determine how large such random breaks are. Statisticians would talk about the breaks being statistically significant with white noise as the null hypothesis.

When the breaks are really large compared to the noise one can see by eye where the positions of the breaks are and this method is nice to make this computation automatically for many stations. When the breaks are “just” large, it is a great method to objectively determine the number of breaks and the optimal break positions.

The problem comes when the noise is larger than the break signal. Not that it is fundamentally impossible to detect such breaks. If you have a 100-year time series with a break in the middle, you would be averaging over 50 noise values on either side and the difference in their averages would be much smaller than the noise itself. Even if noise and signal are about the same size the noise effect is thus expected to be smaller than the size of such a break. To put it in another way, the noise is not correlated in time, while the break signal is the same for many years; that fundamental difference is what the break detection exploits.

However, to come to the fundamental problem, it becomes hard to determine the positions of the breaks. Imagine the theoretical case where the break positions are fully determined by the noise, not by the breaks. From the perspective of the break signal, these break positions are random. The problem is, also random breaks explain a part of the break signal. So one would have a combination with a maximum contribution of the noise plus a part of the break signal. Because of this additional contribution by the break signal, this combination may have larger breaks than expected in a pure noise signal. In other words, the result can be statistically significant, while we have no idea where the positions of the breaks are.

In a real case the breaks look even more statistically significant because the positions of the breaks are determined by both the noise and the break signal.

That is the fundamental problem, the test for the homogeneity of the series rightly detects that the series contains inhomogeneities, but if the signal to noise ratio is low we should not jump to conclusions and expect that the set of break positions that gives us the largest breaks has much to do with the break positions in the data. Only if the signal to noise ratio is high, this relationship is close enough.

Some numbers

This is a general problem, which I expect all statistical homogenization algorithms to have, but to put some numbers on this, we need to specify an algorithm. We have chosen to study the multiple breakpoint method that is implemented in PRODIGE (Caussinus and Mestre, 2004), HOMER (Mestre et al., 2013) and ACMANT (Domonkos and Coll, 2017), these are among the best, if not the best, methods we currently have. We applied it by comparing pairs of stations, like PRODIGE and HOMER do.

For a certain number of breaks this method effectively computes the combination of breaks that has the highest break variance. If you add more breaks, you will increase the break variance those breaks explain, even if it were purely due to noise, so there is additionally a penalty function that depends on the number of breaks. The algorithm selects that option where the break variance minus such a penalty is highest. A statistician would call this a model selection problem and the job of the penalty is to keep the statistical model (the step function explaining the breaks) reasonably simple.

In the end, if the signal to noise ratio is one half, the breaks that explain the largest breaks are just as “good” at explaining the actual break signal in the data as breaks at random positions.

With this detection model, we derived the plot below, let me talk you through this. On the x-axis is the SNR, on the right the break signal is twice as strong as the noise signal. On the y-axis is how well the step function belonging to the detected breaks fits to the step function of the breaks we actually inserted. The lower curve, with the plus symbols, is the detection algorithm as I described above. You can see that for a high SNR it finds a solution that closely matches what we put in and the difference is almost zero. The upper curve, with the ellipse symbols, is for the solution you find if you put in random breaks. You can see that for a high SNR the random breaks have a difference of 0.5. As the variance of the break signal is one, this means that half the variance of the break signal is explained by random breaks.

Figure 13b from Lindau and Venema (2018).

When the SNR is about 0.5, the random breaks are about as good as the breaks proposed by the algorithm described above.

One may be tempted to think that if the data is too noisy, the detection algorithm should detect less breaks, that is, the penalty function should be bigger. However, the problem is not the detection of whether there are breaks in the data, but where the breaks are. A larger penalty thus does not solve the problem and even makes the results slightly worse. Not in the paper, but later I wondered whether setting more breaks is such a bad thing, so we also tried lowering the threshold, this again made the results worse.

So what?

The next question is naturally: is this bad? One reason to investigate correction methods in more detail, as described in my last blog post, was the hope that maybe accurate break positions are not that important. It could have been that the correction method still produces good results even with random break positions. This is unfortunately not the case, already quite small errors in break positions deteriorate the outcome considerably, this will be the topic of the next post.

Not homogenizing the data is also not a solution. As I described in a previous blog post, the breaks in Germany are small and infrequent, but they still have a considerable influence on the trends of stations. The figure below shows the trend differences between many pairs of nearby stations in Germany. Their differences in trends will be mostly due to inhomogeneities. The standard deviation of 0.628 °C per century for the pairs translated to an average error in the trends of individual stations of 0.4 °C per century.

The trend differences (y-axis) of pairs of stations (x-axis) in the German temperature network. The trends were computed from 316 nearby pairs over 1950 to 2000. Figure 2 from Lindau and Venema (2018).

This finding makes it more important to work on methods to estimate the signal to noise ratio of a dataset before we try to homogenize it. This is easier said than done. The method introduced in Lindau and Venema (2018) gives results for every pair of stations, but needs some human checks to ensure the fits are good. Furthermore, it assumes the break levels behave like noise, while in Venema and Lindau (2019) we found that the break signal in the USA behaves like a random walk. This 2019 method needs a lot of data, even the results for Germany are already quite noisy, if you apply it to data sparse regions you have to select entire continents. Doing so, however, biases the results to those subregions were the there are many stations and would thus give too high SNR estimates. So computing SNR worldwide is not just a blog post, but requires a careful study and likely the development of a new method to estimate the break and noise variance.

Both methods compute the SNR for one difference time series, but in a real case multiple difference time series are used. We will need to study how to do this in an elegant way. How many difference series are used depends on the homogenization method, this would also make the SNR method dependent. I would appreciate to also have an estimation method that is more universal and can be used to compare networks with each other.

This estimation method should then be applied to global datasets and for various periods to study which regions and periods have a problem. Temperature (as well as pressure) are variables that are well correlated from station to station. Much more problematic variables, which should thus be studied as well, are precipitation, wind, humidity. In case of precipitation, there tend to be more stations. This will compensate some, but for the other variables there may even be less stations.

We have some ideas how to overcome this problem, from ways to increase the SNR to completely different ways to estimate the influence of inhomogeneities on the data. But they are too preliminary to already blog about. Do subscribe to the blog with any of the options below the tag cloud near the end of the page. ;-)

When we digitize climate data that is currently only available on paper, we tend to prioritize data from regions and periods where we do not have much information yet. However, if after that digitization the SNR would still be low, it may be more worthwhile to digitize data from regions/periods where we already have more data and get that region/period to a SNR above one.

The next post will be about how this low SNR problem changes our estimates of how much the Earth has been warming. Spoiler: the climate “sceptics” will not like that post.

Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization


Caussinus, Henri and Olivier Mestre, 2004: Detection and correction of artificial shifts in climate series. The Journal of the Royal Statistical Society, Series C (Applied Statistics), 53, pp. 405-425.

Domonkos, Peter and John Coll, 2017: Homogenisation of temperature and precipitation time series with ACMANT3: method description and efficiency tests. International Journal of Climatology, 37, pp. 1910-1921.

Lindau, Ralf and Victor Venema, 2018: The joint influence of break and noise variance on the break detection capability in time series homogenization. Advances in Statistical Climatology, Meteorology and Oceanography, 4, p. 1–18.

Lindau, R, Venema, V., 2019: A new method to study inhomogeneities in climate records: Brownian motion or random deviations? International Journal Climatology, 39: p. 4769– 4783. Manuscript: Article:

Mestre, Olivier, Peter Domonkos, Franck Picard, Ingeborg Auer, Stephane Robin, Émilie Lebarbier, Reinhard Boehm, Enric Aguilar, Jose Guijarro, Gregor Vertachnik, Matija Klancar, Brigitte Dubuisson, Petr Stepanek, 2013: HOMER: a homogenization software - methods and applications. IDOJARAS, Quarterly Journal of the Hungarian Meteorological Society, 117, no. 1, pp. 47–67.