Friday 20 November 2020

Yes, it makes sense not to have diner parties while the schools are still open. Think of it as a Corona contact budget.


Can the kids go to school in restaurants

Jessica Winter, editor New Yorker


Analogies can be enlightening. Bad faith actors will always find something to nit pick, but for those interested in understanding analogies can help to open a toolbox of existing ideas and argumentative structures.

I wondered whether it may be useful to talk about Corona contacts as a budget.

It would avoid arguments like "if churches can be open, why can't we have concerts under similar conditions". "If you cannot meet indoors with more than 15 people, then why are schools open? Math!" 

One would never argue "if we just bought his flat, why can't we buy a summer house?" Maybe you have the budget to buy a summer house, but buying a flat does not mean you can also afford the summer house.

Similarly in the political realm: "if we can have social security, why can't we have a basic income (social security for all)?" For me a basic income is freedom, fulfilment of human potential and prosperity, but you will have to find the money. "If we can spend 10% of our GDP on healthcare as an average OECD country, why can't we spend 20%?" You can, and America does, but it will still be hard to find the funding for the additional 10% if countries with universal health care wanted to destroy their system and adopt the American partial system.

When it comes to budgets it is immediately clear that you have to set priorities and invest wisely.

The reproduction number of the SARS-CoV-2 virus is between two and three. Let's assume for this article that it is two to get easier numbers. This means that one infected person on average infects two other people. If we reduce number of infectious contacts by more than half the virus would decline.

The "on average" does a lot of work. How many people one person infects varies widely. As a rule of the thumb for SARS2: Four of five infected people only infect one other person or none, while one in five infects many people. It is only two on average.

And you have to average over a population that is in contact with each other. When in France no one has any contacts, while in Germany life continues as normal, the virus will spread like wildfire in Germany. But if inside the city of Bonn half of the people disappear, the remaining people have less contacts than before. The remaining half should not especially seek each out for the analogy to hold.

How does this analogy help? If we look at the budget of a country like Germany, it makes clear that we should look for reductions where we spend a lot. Work, school, free time. I am as annoyed by the anti-Corona protests as many complaining about them, but compared to 80 million inhabitants that see each other at work and school (indoors) every day, these protests, even if they were really big, are a completely insignificant number of contacts. And the right to protest is a foundation of our societies and should thus have a high priority. I think it is fine to mandate masks at protests and if you do so you should uphold the rule of law.

Less than 20% of Germany is younger than 20. So we could afford to spend our contacts there and ask the other 80% to do more. People often argue that children not going to school is disruptive for the economy. I would also argue a pandemic last one year is a large part of their lives, while additionally young people mostly do this to protect others. There is naturally no need to squander our budget, we could  require older kids to wear masks to reduce the effective number of contacts, install air filters or far UC-V lights in class rooms or reduce the number of days children go to school.

Some feel we should close the schools to protect teachers, but the main reason to care about avoiding contacts is, even now, not about the people being infected today, but about the spreading the virus and all the people who will die because of that. 

If we life above our contact budget most of the dying happens after several links in the chain of infection and no longer close to the school: The teacher or student infects 2 others, they infect 4 other, 8, 16, 32, 64, 128, ... Those 128 will reside all over the city/county, if not state and have many different professions. If we would life within our Corona budget and the level of infection would be and stay low, the entire community, including teachers, would be safe.

The exponential growth of a virus also nicely fits to the exponential growth of money in your [[savings account]]. I added a link for young people. A savings account used to be a place where you would keep you money and the bank would give you a percentage of the amount as a thank you, which they called "interest". People who are into money and budgets likely still remember this and how it was normal to "invest" money to have more money later. 

When the press talks about exponential growth, I tend to worry they simply mean fast growth. Economic growth is much slower than the pandemic, but when it comes to money people get glowing eyes and talk enthusiastically about compound interest and putting something aside for later.

Similarly when a society invests in less contacts, we can have more freedom later.  Even more so because once the number of infections is low enough track and trace becomes much more efficiently and you get double returns on investment. Like an investment banker who has to pay less taxes because ... reasons.

At least the financial press should know the famous example of exponential growth: the craftsman who "only" asks the king for rice as payment for his chessboard: one grain on the first square, two on the second, four on the third square and so on. 


What is true for infections is also true for hospital beds and ICU beds. Once half of your patients are COVID-19 patient, it is only a matter of one more doubling time and the capacity is filled. Exponential growth is not just fast, it overwhelms linear systems like hospitals where you cannot keep on doubling the number of beds. 
If we let it get this far we are forcing doctors to choose who lives. Who is in the ICU too long and would likely stay there a long time while this capacity could be used for multiple new patients. Who is removed for the ICU to die. A healthy society does not put doctors in such a position.

With good care around 1 percent of people die in the West (in young societies in Africa less). Supporters of the virus tend to use this number or even much lower fantasy numbers. However, if we let it get out of control like this, ignore the exponential growth and the delay between infections and deaths, the hospital care would collapse and a few percent would die.

Many more people need to got the hospital. In Germany this is 17%. A recent French study reported that after 110 day most patients are still tired and have trouble breathing, many did not yet work again.
At the latest when the hospitals collapse people will reduce contacts, even if not mandated. It is much smarter make an investment earlier, to reduce our number of infectious contacts earlier. 
A well-know American president said it is smart to go bankrupt. It is smarter to make money.
Investing early pays of even more because then more subtle measures are still possible, while in an emergency a much more invasive lockdown will be necessary and, for those that only care about money, more damage to the economy will be done.

(As many of my readers are interested in climate change, let me add that I find it weird that when it comes to protecting the climate people often talk about it as a cost and not as an investment that will pay good dividends in the future, just like any other investment. If you mind that our kids will thus have it better than we have it, you can finance the investments with loans, like any business would.)

Related reading


Monday 9 November 2020

Science Feedback on Steroids

Climate Feedback is a group of climate scientists reviewing press articles on climate change. By networking this valuable work with science-interested citizens we could put this initiative on steroids.

Disclosure, I am member of Climate Feedback.

How Climate Feedback works

Climate Feedback works as follows. A science journalist monitors which stories on climate change are shared much on social media and invites publishing climate scientists with relevant expertise to review the factual claims being made. The scientists make detailed reviews on concrete claims, ideally using web annotations (see example below), sometimes by email.



They also write a short summary of the article and grade its scientific credibility. These comments, summaries and grades are then summarized in a graphic and an article written by the science journalist. 

Climate Feedback takes care of spreading the reviews to the public and to the publication that was reviewed. Climate Feedback is also part of a network of fact checking organizations giving them more credibility and they add metadata to the review pages that social media and search engines can show their users.



For scientists this is a very efficient fact checking operation. The participants only have to respond to the claims they have expertise on. If there are many claims outside my expertise I can wait until my colleagues added their web annotations before I write my summary and determine my grade. Especially compared to writing a blog post Climate Feedback is very effective.

The initiative recently branched out to reviewing health claims with a new Health Feedback group. The umbrella is now called Science Feedback.

The impact

But there is only so much a group of scientists can do and by the time the reviews are in and summarized the article is mostly old news. Only a small fraction of readers would see any notifications social media systems could put on posts spreading them.

This is still important information for people who closely follow the topic, helps them to see how such reviews are done, assess which publication are reliable and helps to see which groups are credible. 

The reviews may be most important for the journalists and the publications involved. Journalists doing high quality work can now demonstrate this to editors who will mostly not be able to assess this themselves. Some journalists have even asked for reviews of important pieces to showcase the quality of their work. Reversely editors can seek out good journalists and cut ties with journalists regularly hurt their reputation. The latter naturally only helps publications that care about quality.

The Steroids

With a larger group we could review more articles and have results while people are still reading it. There are not enough (climate) scientists to do this. 

For Climate Feedback I only review articles on topics where I have expertise. But I think I would still do a decent job outside of my expertise. It is hard to determine how good a good article is, but the ones that are clearly bad are easy to identify and this does not require much expertise. At least in the climate branch of the US culture war the same tropes are used over and over again, the same "thinking" errors are made over and over again. 

Many who are interested in climate change are interested in scientific detail, but are not scientists, would probably do a good job identifying these bad articles. Maybe even better. They say that magicians were better at debunking paranormal claims than scientists. We live in a bubble where most argue in good faith and science-interested normal citizens may well have a better BS detector.

However, how do we know who is good at this? Clearly not everyone, otherwise such a service would not be needed. We would have the data from Climate Feedback and Health Feedback to determine which citizen scientist's assessments predict the assessments of the scientists well. We could also ask people to classify the topic of the article. I would be best at observational climatology, decent in physical climatology and likely only average when it comes to many climate change impacts and economic questions. We could also ask people how confident they are in their assessments.

In the end it would be great to ingest ratings in a user friendly way with 1) a browser add-on on the article homepage itself, 2) replying to posts mentioning the article on social media, like replying to a tweet adding the handle of the PubPeerBot automatically submits the tweet to PubPeer.

A server would compute the ratings and as soon as there is enough data create a review homepage with the ratings as metadata to be used by search engines and social media sites. We will have to see if they are willing to use such a statistical product. Also an application programming interface (API) and ActivityPub can be used to spread the information to interested parties.

I would be happy to use this information on the micro-blogging system for scientists Frank Sonntag and I have set up. I presume more Open Social Media communities would be grateful for the information to make their place more reality-friendly. A browser add-on could also display the feedback on the article's homepage itself and on posts linking to it.

How to start?

Before creating such a huge system I would propose a much smaller feasibility study. Here people would be informed about articles Climate or Health Feedback are working on and they can return their assessments until the one of Climate Feedback is published. This could be a simple email distribution list to distribute the articles and a cloud-based spread sheet or web form to return the results. 

This system should be enough to study whether citizens can distinguish fact from fiction well enough (I expect so, but knowing for sure is valuable) and develop statistical methods to estimate how well people are doing, how to compute an all over score and how many reviews are needed to do so.

This set-up points to two complications the full system would have. Firstly, only citizen's assessments that are made before the official feedback can be used. this should not be too much of a problem as most readers will read the article before the official feedback is published.

Secondly, as the number of official feedbacks will be small many volunteers will likely not review any of these articles themselves or just a few. Thus the assessment of how accurate the predictions of person A of articles X, Y and Z are may have to be assessed comparing their assessments with those of B, C and D who review X, Y or Z as well as one of the articles Climate Feedback reviewed. This makes the computation more complicated and uncertain, but if B, C and D are good enough, this should be doable. Alternatively, we would have to keep on informing our volunteers of the articles being reviewed by the scientists themselves.

This new system could be part of Science Feedback or an independent initiative. I feel, it would at least be good to have a separate homepage as the two systems are quite different and the public should not mix them up. A reason to keep it separate is that this system could also be used in combination with other fact checkers, but we could also make that organizational change when it comes to that.

Another organization question is whether we would like Google and Facebook to have access to this information or prefer a license that excludes them. Short term it is naturally best when they also use it to inform as many people as possible. Long-term it would also be valuable to break the monopolies of Google and Facebook. Having alternative services that can deliver better quality due to our assessments could contribute to that. They have money, we have people.

I asked on Twitter and Mastodon whether people would be interested in contributing to such a system. Fitting to my prejudice people on Twitter were more willing to review (I do more science on Twitter) and people on Mastodon were more willing to build software (Mastodon started with many coders).

What do you think? Could such a system work? Would enough people be willing to contribute? Is it technologically and statistically feasible? Any ideas to make the system or the feasibility study better?

Related reading

Climate Feedback explainer from 2016: Climate scientists are now grading climate journalism
Discussion of a controversial Climate Feedback and the grading system used: Is nitpicking a climate doomsday warning allowed?

Monday 12 October 2020

The deleted chapter of the WMO Guidance on the homogenisation of climate station data

The Task Team on Homogenization (TT-HOM) of the Open Panel of CCl Experts on Climate Monitoring and Assessment (OPACE-2) of the Commission on Climatology (CCl) of the World Meteorological Organization (WMO) has published their Guidance on the homogenisation of climate station data.

The guidance report was a bit longish, so at the end we decided that the last chapter on "Future research & collaboration needs" was best deleted. As chair of the task team and as someone who likes tp dream about what others could do in a comfy chair, I wrote most of this chapter and thus we decided to simply make it a blog post for this blog. Enjoy.


This guidance is based on our current best understanding of inhomogeneities and homogenisation. However, writing it also makes clear there is a need for a better understanding of the problems.

A better mathematical understanding of statistical homogenisation is important because that is what most of our work is based on. A stronger mathematical basis is a prerequisite for future methodological improvements.

A stronger focus on a (physical) understanding of inhomogeneities would complement and strengthen the statistical work. This kind of work is often performed at the station or network level, but also needed at larger spatial scales. Much of this work is performed using parallel measurements, but they are typically not internationally shared.

In an observational science the strength of the outcomes depends on a consilience of evidence. Thus having evidence on inhomogeneities from both statistical homogenisation and physical studies strengthens the science.

This chapter will discuss the needs for future research on homogenisation grouped in five kinds of problems. In the first section we will discuss research on improving our physical understanding and physics-based corrections. The next section is about break detection, especially about two fundamental problems in statistical homogenisation: the inhomogeneous-reference problem and the multiple-breakpoint problem.

Next write about computing uncertainties in trends and long-term variability estimates from homogenised data due to remaining inhomogeneities. It may be possible to improve correction methods by treating it as a statistical model selection problem. The last section discusses whether inhomogeneities are stochastic or deterministic and how that may affect homogenisation and especially correction methods for the variability around the long-term mean.

For all the research ideas mentioned below, it is understood that in future we should study more meteorological variables than temperature. In addition, more studies on inhomogeneities across variables could be helpful to understand the causes of inhomogeneities and increase the signal to noise ratio. Homogenisation by national offices has advantages because here all climate elements from one station are stored together. This helps in understanding and identifying breaks. It would help homogenisation science and climate analysis to have a global database for all climate elements, like iCOADS for marine data. A Copernicus project has started working on this for land station data, which is an encouraging development.

Physical understanding

It is a good scientific practice to perform parallel measurements in order to manage unavoidable changes and to compare the results of statistical homogenisation to the expectations given the cause of the inhomogeneity according to the metadata. This information should also be analysed on continental and global scales to get a better understanding of when historical transitions took place and to guide homogenisation of large-scale (global) datasets. This requires more international sharing of parallel data and standards on the reporting of the size of breaks confirmed by metadata.

The Dutch weather service KNMI published a protocol how to manage possible future changes of the network, who decides what needs to be done in which situation, what kind of studies should be made, where the studies should be published and that the parallel data should be stored in their central database as experimental data. A translation of this report will soon be published by the WMO (Brandsma et al., 2019) and will hopefully inspire other weather services to formalise their network change management.

Next to statistical homogenisation, making and studying parallel measurements, and other physical estimates, can provide a second line of evidence on the magnitude of inhomogeneities. Having multiple lines of evidence provides robustness to observational sciences. Parallel data is especially important for the large historical transitions that are most likely to produce biases in network-wide to global climate datasets. It can validate the results of statistical homogenisation and be used to estimate possibly needed additional adjustments. The Parallel Observations Science Team of the International Surface Temperature Initiative (ISTI-POST) is working on building such a global dataset with parallel measurements.

Parallel data is especially suited to improve our physical understand of the causes of inhomogeneities by studying how the magnitude of the inhomogeneity depends on the weather and on instrumental design characteristics. This understanding is important for more accurate corrections of the distribution, for realistic benchmarking datasets to test our homogenisation methods and to determine which additional parallel experiments are especially useful.

Detailed physical models of the measurement, for example, the flow through the screens, radiative transfer and heat flows, can also help gain a better understanding of the measurement and its error sources. This aids in understanding historical instruments and in designing better future instruments. Physical models will also be paramount for understanding the impact of the surrounding on the measurement — nearby obstacles and surfaces influencing error sources and air flow — to changes in the measurand, such as urbanisation/deforestation or the introduction of irrigation. Land-use changes, especially urbanisation, should be studied together with relocations they may provoke.

Break detection

Longer climate series typically contain more than one break. This so-called multiple-breakpoint problem is currently an important research topic. A complication of relative homogenisation is that also the reference stations can have inhomogeneities. This so-called inhomogeneous-reference problem is not optimally solved yet. It is also not clear what temporal resolution is best for detection and what the optimal way is to handle the seasonal cycle in the statistical properties of climate data and of many inhomogeneities.

For temperature time series about one break per 15 to 20 years is typical and multiple breaks are thus common. Unfortunately, most statistical detection methods have been developed for one break and for the null hypothesis of white (sometimes red) noise. In case of multiple breaks the statistical test should not only take the noise variance into account, but also the break variance from breaks at other positions. For low signal to noise ratios, the additional break variance can lead to spurious detections and inaccuracies in the break position (Lindau and Venema, 2018a).

To apply single-breakpoint tests on series with multiple breaks, one ad-hoc solution is to first split the series at the most significant break (for example, the standard normalised homogeneity test, SNHT) and investigate the subseries. Such a greedy algorithm does not always find the optimal solution. Another solution is to detect breaks on short windows. The window should be short enough to contain only one break, which reduces power of detection considerably. This method is not used much nowadays.

Multiple breakpoint methods can find an optimal solution and are nowadays numerically feasible. This can be done in a hypothesis testing (MASH) or in a statistical model selection framework. For a certain number of breaks these methods find the break combination that minimize the internal variance, that is variance of the homogeneous subperiods, (or you could also state that the break combination maximizes the variance of the breaks). To find the optimal number of breaks, a penalty is added that increases with the number of breaks. Examples of such methods are PRODIGE (Caussinus & Mestre, 2004) or ACMANT (based on PRODIGE; Domonkos, 2011b). In a similar line of research Lu et al. (2010) solved the multiple breakpoint problem using a minimum description length (MDL) based information criterion as penalty function.

This penalty function of PRODIGE was found to be suboptimal (Lindau and Venema, 2013). It was found that the penalty should be a function of the number of breaks, not fixed per break and that the relation with the length of the series should be reversed. It is not clear yet how sensitive homogenisation methods respond to this, but increasing the penalty per break in case of low SNR to reduce the number of breaks does not make the estimated break signal more accurate (Lindau and Venema, 2018a).

Not only the candidate station, also the reference stations will have inhomogeneities, which complicates homogenisation. Such inhomogeneities can be climatologically especially important when they are due to network-wide technological transitions. An example of such a transition is the current replacement of temperature observations using Stevenson screens by automatic weather stations. Such transitions are important periods as they may cause biases in the network and global average trends and they produce many breaks over a short period.

A related problem is that sometimes all stations in a network have a break at the same date, for example, when a weather service changes the time of observation. Nationally such breaks are corrected using metadata. If this change is unknown in global datasets one can still detect and correct such inhomogeneities statistically by comparison with other nearby networks. That would require an algorithm that additionally knows which stations belong to which network and prioritizes correcting breaks found between stations in different networks. Such algorithms do not exist yet and information on which station belongs to which network for which period is typically not internationally shared.

The influence of inhomogeneities in the reference can be reduced by computing composite references over many stations, removing reference stations with breaks and by performing homogenisation iteratively.

A direct approach to solving this problem would be to simultaneously homogenise multiple stations, also called joint detection. A step in this direction are pairwise homogenisation methods where breaks are detected in the pairs. This requires an additional attribution step, which attributes the breaks to a specific station. Currently this is done by hand (for PRODIGE; Caussinus and Mestre, 2004; Rustemeier et al., 2017) or with ad-hoc rules (by the Pairwise homogenisation algorithm of NOAA; Menne and Williams, 2009).

In the homogenisation method HOMER (Mestre et al., 2013) a first attempt is made to homogenise all pairs simultaneously using a joint detection method from bio-statistics. Feedback from first users suggests that this method should not be used automatically. It should be studied how good this methods works and where the problems come from.

Multiple breakpoint methods are more accurate as single breakpoint methods. This expected higher accuracy is founded on theory (Hawkins, 1972). In addition, in the HOME benchmarking study it was numerically found that modern homogenisation methods, which take the multiple breakpoint and the inhomogeneous reference problems into account, are about a factor two more accurate as traditional methods (Venema et al., 2012).

However, the current version of CLIMATOL applies single-breakpoint detection tests, first SNHT detection on a window then splitting, to achieve results comparable to modern multiple-breakpoint methods with respect to break detection and homogeneity of the data (Killick, 2016). This suggests that the multiple-breakpoint detection principle may not be as important as previously thought and warrants deeper study or the accuracy of CLIMATOL is partly due to an unknown unknown.

The signal to noise ratio is paramount for the reliable detection of breaks. It would thus be valuable to develop statistical methods that explain part of the variance of a difference time series and remove this to see breaks more clearly. Data from (regional) reanalysis could be useful predictors for this.

First methods have been published to detect breaks for daily data (Toreti et al., 2012; Rienzner and Gandolfi, 2013). It has not been studied yet what the optimal resolution for breaks detection is (daily, monthly, annual), nor what the optimal way is to handle the seasonal cycle in the climate data and exploit the seasonal cycle of inhomogeneities. In the daily temperature benchmarking study of Killick (2016) most non-specialised detection methods performed better than the daily detection method MAC-D (Rienzner and Gandolfi, 2013).

The selection of appropriate reference stations is a necessary step for accurate detection and correction. Many different methods and metrics are used for the station selection, but studies on the optimal method are missing. The knowledge of local climatologists which stations have a similar regional climate needs to be made objective so that it can be applied automatically (at larger scales).

For detection a high signal to noise ratio is most important, while for correction it is paramount that all stations are in the same climatic region. Typically the same networks are used for both detection and correction, but it should be investigated whether a smaller network for correction would be beneficial. Also in general, we need more research on understanding the performance of (monthly and daily) correction methods.

Computing uncertainties

  • Also after homogenisation uncertainties remain in the data due to various problems: Not all breaks in the candidate station have been and can be detected.

  • False alarms are an unavoidable trade-off for detecting many real breaks.

  • Uncertainty in the estimation of correction parameters due to limited data.

  • Uncertainties in the corrections due to limited information on the break positions.

From validation and benchmarking studies we have a reasonable idea about the remaining uncertainties that one can expect in the homogenised data, at least with respect to changes in the long-term mean temperature. For many other variables and changes in the distribution of (sub-)daily temperature data individual developers have validated their methods, but systematic validation and comparison studies are still missing.

Furthermore, such studies only provide a general uncertainty level, whereas more detailed information for every single station/region and period would be valuable. The uncertainties will strongly depend on the signal to noise ratios, on the statistical properties of the inhomogeneities of the raw data and on the quality and cross-correlations of the reference stations. All of which vary strongly per station, region and period.

Communicating such a complicated errors structure, which is mainly temporal, but also partially spatial, is a problem in itself. Furthermore, not only the uncertainty in the means should be considered, but, especially for daily data, uncertainties in the complete probability density function need to be estimated and communicated. This could be communicated with an ensemble of possible realisations, similar to Brohan et al. (2006).

An analytic understanding of the uncertainties is important, but is often limited to idealised cases. Thus also numerical validation studies, such as the past HOME and upcoming ISTI studies are important for an assessment of homogenisation algorithms under realistic conditions.

Creating validation datasets also help to see the limits of our understanding of the statistical properties of the break signal. This is especially the case for variables other than temperature and for daily and (sub-)daily data. Information is needed on the real break frequencies and size distributions, but also their auto-correlations and cross-correlations, as well as explained in the next section the stochastic nature of breaks in the variability around the mean.

Validation studies focussed on difficult cases would be valuable for a better understanding. For example, sparse networks, isolated island networks, large spatial trend gradients and strong decadal variability in the difference series of nearby stations (for example, due to El Nino in complex mountainous regions).

The advantage of simulated data is that it can create a large number of quite realistic complete networks. For daily data it will remain hard for the years to come to determine how to generate a realistic validation dataset. Thus even if using parallel measurements is mostly limited to one break per test, it does provide the highest degree of realism for this one break.

Deterministic or stochastic corrections?

Annual and monthly data is normally used to study trends and variability in the mean state of the atmosphere. Consequently, typically only the mean is adjusted by homogenisation. Daily data, on the other hand is used to study climatic changes in weather variability, severe weather and extremes. Consequently, not only the mean should be corrected, but the full probability distribution describing the variability of the weather.

The physics of the problem suggests that many inhomogeneities are caused by stochastic processes. An example affecting many instruments are differences in the response time of instruments, which can lead to differences determined by turbulence. A fast thermometer will on average read higher maximum temperatures than a slow one, but this difference will be variable and sometimes be much higher than the average. In case of errors due to insolation the radiation error will be modulated by clouds. An insufficiently shielded thermometer will need larger corrections on warm days, which will typically be more sunny, but some warm days will be cloudy and not need much correction, while other warm days are sunny and calm and have a dry hot surface. The adjustment of daily data for studies on changes in the variability is thus a distribution problem and not only a regression bias-correction problem. For data assimilation (numerical weather prediction) accurate bias correction (with regression methods) is probably the main concern.

Seen as a variability problem, the correction of daily data is similar to statistical downscaling in many ways. Both methodologies aim to produce bias-corrected data with the right variability, taking into account the local climate and large-scale circulation. One lesson from statistical downscaling is that increasing the variance of a time series deterministically by multiplication with a fraction, called inflation, is the wrong approach and that the variance that could not be explained by regression using predictors should be added stochastically as noise instead (Von Storch, 1999). Maraun (2013) demonstrated that the inflation problem also exists for the deterministic Quantile Matching method, which is also used in daily homogenisation. Current statistical correction methods deterministically change the daily temperature distribution and do not stochastically add noise.

Transferring ideas from downscaling to daily homogenisation is likely fruitful to develop such stochastic variability correction methods. For example, predictor selection methods from downscaling could be useful. Both fields require powerful and robust (time invariant) predictors. Multi-site statistical downscaling techniques aim at reproducing the auto- and cross-correlations between stations (Maraun et al., 2010), which may be interesting for homogenisation as well.

The daily temperature benchmarking study of Rachel Killick (2016) suggests that current daily correction methods are not able to improve the distribution much. There is a pressing need for more research on this topic. However, these methods likely also performed less well because they were used together with detection methods with a much lower hit rate than the comparison methods.

The deterministic correction methods may not lead to severe errors in homogenisation, that should still be studied, but stochastic methods that implement the corrections by adding noise would at least theoretically fit better to the problem. Such stochastic corrections are not trivial and should have the right variability on all temporal and spatial scales.

It should be studied whether it may be better to only detect the dates of break inhomogeneities and perform the analysis on the homogeneous subperiods (removing the need for corrections). The disadvantage of this approach is that most of the trend variance is in the difference in the mean of the HSPs and only a small part is in the trend within the HPSs. In case of trend analysis, this would be similar to the work of the Berkeley Earth Surface Temperature group on the mean temperature signal. Periods with gradual inhomogeneities, e.g., due to urbanisation, would have to be detected and excluded from such an analysis.

An outstanding problem is that current variability correction methods have only been developed for break inhomogeneities, methods for gradual ones are still missing. In homogenisation of the mean of annual and monthly data, gradual inhomogeneities are successfully removed by implementing multiple small breaks in the same direction. However, as daily data is used to study changes in the distribution, this may not be appropriate for daily data as it could produce larger deviations near the breaks. Furthermore, changing the variance in data with a trend can be problematic (Von Storch, 1999).

At the moment most daily correction methods correct the breaks one after another. In monthly homogenisation it is found that correcting all breaks simultaneously (Caussinus and Mestre, 2004) is more accurate (Domonkos et al., 2013). It is thus likely worthwhile to develop multiple breakpoint correction methods for daily data as well.

Finally, current daily correction methods rely on previously detected breaks and assume that the homogeneous subperiods (HSP) are homogeneous (i.e., each segment between breakpoints assume to be homogeneous) . However, these HSP are currently based on detection of breaks in the mean only. Breaks in higher moments may thus still be present in the "homogeneous" sub periods and affect the corrections. If only for this reason, we should also work on detection of breaks in the distribution.

Correction as model selection problem

The number of degrees of freedom (DOF) of the various correction methods varies widely. From just one degree of freedom for annual corrections of the means, to 12 degrees of freedom for monthly correction of the means, to 40 for decile corrections applied to every season, to a large number of DOF for quantile or percentile matching.

A study using PRODIGE on the HOME benchmark suggested that for typical European networks monthly adjustment are best for temperature; annual corrections are probably less accurate because they fail to account for changes in seasonal cycle due to inhomogeneities. For precipitation annual corrections were most accurate; monthly corrections were likely less accurate because the data was too noisy to estimate the 12 correction constants/degrees of freedom.

What is the best correction method depends on the characteristics of the inhomogeneity. For a calibration problem just the annual mean could be sufficient, for a serious exposure problem (e.g., insolation of the instrument) a seasonal cycle in the monthly corrections may be expected and the full distribution of the daily temperatures may need to be adjusted. The best correction method also depends on the reference. Whether the variables of a certain correction model can be reliably estimated depends on how well-correlated the neighbouring reference stations are.

An entire regional network is typically homogenised with the same correction method, while the optimal correction method will depend on the characteristics of each individual break and on the quality of the reference. These will vary from station to station, from break to break and from period to period. Work on correction methods that objectively select the optimal correction method, e.g., using an information criterion, would be valuable.

In case of (sub-)daily data, the options to select from become even larger. Daily data can be corrected just for inhomogeneities in the mean (e.g., Vincent et al., 2002, where daily temperatures are corrected by incorporating a linear interpolation scheme that preserves the previously defined monthly corrections) or also for the variability around the mean. In between are methods that adjust for the distribution including the seasonal cycle, which dominates the variability and is thus effectively similar to mean adjustments with a seasonal cycle. Correction methods of intermediate complexity with more than one, but less than 10 degrees of freedom would fill a gap and allow for more flexibility in selecting the optimal correction model.

When applying these methods (Della-Marta and Wanner, 2006; Wang et al., 2010; Mestre et al., 2011; Trewin, 2013) the number of quantile bins (categories) needs to be selected as well as whether to use physical weather-dependent predictors and the functional form they are used (Auchmann and Brönnimann, 2012). Objective optimal methods for these selections would be valuable.

Related information

WMO Guidelines on Homogenization (English, French, Spanish) 

WMO guidance report: Challenges in the Transition from Conventional to Automatic Meteorological Observing Networks for Long-term Climate Records

Sunday 30 August 2020

A primer on herd immunity for Social Darwinists

Herd immunity has been proposed as a way to deal with the new Corona virus. In the best case, it is a call to slow down the spread of the SARS-CoV-2 virus enough so that the hospitals can just handle the flood of patients, in the worst case it is a fancy term for just doing nothing and let everyone get sick. 

Trump often talked about letting the pandemic wash over America. Insider reports confirm he was indeed talking about herd immunity. When I first heard claims that Boris Johnson was pursuing herd immunity, I assumed his political opponents were smearing him and trying to get him to act, but it seems as if this was really his plan. Of all world leaders Jair Bolsonaro may be most in denial about the pandemic, which he calls a little flu. He also advocated herd immunity. All these leaders have downplayed the threat, which by itself helps spread the decease, and advocated policies that promote infection, leading to more infected, sick and dead people.

America and Brazil lead the COVID-19 death rankings unchallenged with respectively 187 and 120 thousand total deaths and around one thousand people dying every day the last month. The UK is the country with the most COVID-19 deaths in Europe, while it was lucky to get it late.

This "strategy" has a certain popularity among Trump-like politicians. I do not think they know what they are doing. Scientific advice tends to come from a humanist perspective where every life is valued. Such advice is naturally rejected by Social Darwinians, who in the best case do not care about most people. While these politicians naturally see themselves as more valuable than us, they tend not to excel in academics. So let me explain why herd immunity is also a bad policy from their perspective, even if up to now people from groups they hate had a higher risk of dying.

Herd immunity

If we do not take any preventative measures one SARS-CoV-2 infected person infects two or three further people. This may not sound like much, but this is an example of exponential growth. It is the same situation of the craftsman who "only" asks the king for rice as payment for his chessboard: one grain on the first square, two on the second, four on the third square and so on.

If we assume that one person only infects two other people, that is that the base reproduction number is two, then the sequence is: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, ...
536,870,912, ...

Those are just 30 steps to get to half a billion and the steps take about 5 days in case of SARS-CoV-2. So that is half a year. With good care 1 to 2 percent of people die, that would be at this point 5 to 10 million people. Do be highly optimistic and it is still 2 million people.

Many more people need to got the hospital. In Germany this is 17%. A recent French study reported that after 110 day most patients are still tired and have trouble breathing, many did not yet work again. That would be around 200 million people with long-lasting health problems.

This will naturally not happen in reality. People will take action to reduce the reproduction number, whether the government mandates it or not. And at a certain moment an infected person will not infect as many people because many are already immune. If the base reproduction number is two and half the population is immune, the infected person will only infect one other person, that is the effective reproduction number is one.

The actual base reproduction number is most likely larger than two and reality is more complicated, so experts estimates that the actual herd immunity level is not 50%, but between 60 and 70%. More complication is that it is possible that people are sufficiently immune to avoid getting ill again, but the immunity may not prevent people from getting infected and transmitting the virus. There is a strong case of a 33 year old man from Hong Kong, who got infected twice, but did not get ill. If this were typical, herd immunity would not exist.

You may have heard experts say that once this immunity level has been reached, that the pandemic is over. But this does not mean that the virus is gone. Europe needed several months of an effective reproduction number well below one to get to low infection numbers (and the virus is still not gone). This was after a drastic decrease in the effective reproduction number (R) due to public health measures, in case of herd immunity it would initially be around one, and only very slowly go below one.

Say that when we reach R = 1 when one million people are infected, the after one step later (5 days) another one million people are infected. One million of 30% of the world population is not much. So also R will still be almost one. In other words, it would take several years for the virus to go away even in the best case. In the worst case, the virus mutates, people lose some immunity and new babies are immunologically naive. So most likely SARS-CoV-2 would stay with us forever.

Reaching herd immunity will not help Trump. He will still be bunkered down in the White House surrounded by staff that is tested every day so as not to infect him, while the calls on others do go out, without a good testing system, and die for him. Trump's billionaire buddies will still need to lock themselves up in their mansions or high-sea yachts, counting how much richer they got from Trump's COVID-19 bailout. The millionaire hosts at Fox News will stay at home telling others to go out to work even if their work place is not safe and to accelerate the pandemic by sending kids to schools even it the schools are not safe. They will still need to wait until there is a vaccine. The herd immunity strategy only ensures that up to that time the largest number of people have died.

When the virus is everywhere, good luck trying to keep it out of elderly homes. In 2016 Trump won in the age groups above 50. The UK Conservatives had a 47 point lead among those aged 65+. In Brazil Bolsonaro had a 16 percent point lead for people older than 60. They will be the ones dying and seeing their friends die. This is not helpful for the popular support of far right politicians.

The elite may think that it will be the poorest 70% that get infected. Far right Republicans may hope that it will affect Democrats and people of color more. It is true that at the moment poor people are more affected as they cannot afford staying at home even if their place of work is not safe. It is true that initially mostly blue states and cities were affected in America.

Let's take the theoretical case where the poorest 70% are infected or immune and the richest 30% still immunologically naive. As soon as one of these 30% are infected, it will spread like wild fire as rich people tend to hang out with rich people, so the virus would easily find two or three rich people to infect next.

That is one reason why it is too simple to equate a base reproduction number with a herd immunity of 50%. This would be the case if the population were perfectly mixed. But any network were the immunity level is not yet 50% is up for grabs. In the end everyone will get it, rich or poor, red or blue.

The only Social Darwinists for whom this pays are billionaires who have their own private hospital, with their own nurses, doctors at their mansion. They would have a chance of 1 to 2 percent to die. While if they manage to convince the people to go for herd immunity and not even to stay below the carrying capacity of the hospitals around 5% of the population would die. That is a 2 to 3% survival difference. Not sure that is worth getting all your politicians kicked out of office.

It naturally also helps the high frequency traders. Like the Mercer family who funded Trump in 2016 when no one thought he was a good investment. They have made so much money from the chaos Trump produces. Up or down the high frequency trader wins. Down goes faster. They live their lives on chaos, suffering and destruction. I presume they have a private hospital, they have the money.

But for the average Joe Social Darwinist there are nearly no gains and it is bad politics. It hurts your country compared to more social democratic countries and at home it helps lefties get into power and implement disgusting policies that help everyone.

Related reading

Nature: The false promise of herd immunity for COVID-19. Why proposals to largely let the virus run its course — embraced by Donald Trump’s administration and others — could bring “untold death and suffering”.

Sunday 26 July 2020

Micro-blogging for scientists without nasties and surveillance

Start screen picture of Mastodon: A Mastodon playing with paper airplanes.

Two years ago I joined Mastodon to get to know a more diverse group of people here in Bonn. Almost two thousand messages later, I can say I really like it there and social networks like Mastodon are much more healthy for society as well. Together with Frank Sonntag we have recently set up a Mastodon server for publishing scientists. Let me explain how it works and why this system is better for the users and society.

Mastodon looks a lot like Twitter, i.e. it is a micro-blogging system, but many tweaks make it a much friendlier place where you can have meaningful conversations. One exemplary difference is that there are no quote tweets. Quoting rather than simply replying is often used by large accounts to bully small ones by pulling in many people into the "conversation" who disagree. I do miss quote tweets, they can also be used for good, to highlight what is interesting about a tweet or to explain something that the writer assumed their readers know, but your readers may not know. But quote tweets make the atmosphere more adversarial, less about understanding and talking with each other. Conflict leads to more engagement and more time on the social network, so Twitter and Facebook like it, but pitting groups against each other is not the public debate that makes humanity better.

The main difference under the hood is that the system is not controlled by one corporation. There is not one server, but many servers that seamlessly talk with each other, just like the email system. The communication protocol (ActivityPub) is a standard of the World Wide Web Consortium, just like HTML and HTTPS, which powers the web.

This means that you can chose the server and interface you like and still talk to others, while people on Twitter, Facebook, Instagram, WordPress and Tumblr can only talk to other people in their silo. As they say the modern internet is a group of five websites, each consisting of screenshots from the other four. It is hard to leave these silos, it would cut you off from your friends. This is also why the system naturally evolves into a few major players. Their service is as bad as one would expect with the monopoly power this network effect gives them.

The Fediverse and its soial networks as icons

ActivityPub is not only used by Mastodon, but also by other micro-blogging social networks such as Pleroma, blogging networks such as, podcasting services such as FunkWhale and file hosting such as NextCloud. There is a version of Instagram (PixelFed) and of YouTube (PeerTube). With ActivityPub all these social networks can talk to each other. Where they do different things, the system is designed to degrade gracefully. FixelFed shows photos more beautifully, has collections and filters, but Mastodon gracefully shows the recent photos as a photo below a message. PeerTube shows one large video on a page, just like Twitter, Mastodon shows the newest videos in small below a message in the news feed. The full network is called the fediverse, a portmanteau of federation and universe.

Currently all these services are ad-free and tracking-free. The coding of the open source software is largely a labor of love, even if some coders are supported by micro-funding, for example Patreon or Liberapay. Most servers are maintained by people as hobby, some (like for email) by organization for their members, some larger ones again use Patreon or Liberapay, some are even coops.

This means that technology enthusiasts from the middle class are mostly behind these networks. That is better than a few large ad corporations, but still not as democratic as one would like for such an important part of our society.


Not only can these networks talk to each other, they also themselves consist of many different servers each maintained by another group, just as the email system. This means that moderation of the content is much better than on Twitter or Facebook. The owners of the servers want to create a functional community, while these communities are relatively small. So they can invest much more time per moderation decision than a commercial silo would. Also if the moderation fails, people will go somewhere else.

Individual moderation decisions only pertain one server and are thus less impactful and can consequently be more forceful. If you do not like the moderation, you can move to another server that fits your values better. If you are kicked off a server, you can go to another one and still talk to your friends. Facebook kicking someone off Facebook or Twitter kicking someone off Twitter is somewhat of a big deal and is thus only done in extreme cases, when someone already created a lot of damage to the social fabric, while others make the atmosphere toxic staying below the radar.

If someone is really annoying they may naturally be removed from many servers. Then it does become a problem for this person, but that only happens when many server administrator agree you are not welcome. So maybe that person is really not an enrichment for humanity.

The extreme example would be Nazis. Some Nazis were too extreme for Twitter and started their own micro-blogging network. Probably most Nazis know the name already, but I think it is a good policy not to help bad actors with PR. As this network was used to coordinate their violent and inhumane actions, Google and Apple have removed their apps from their app stores. I may like that outcome, but these corporations should not have that power. Next this network started using ActivityPub, so that they can use ActivityPub apps. The main Activity network does not like Nazis, so they all blocked this network.

I feel this is a good solution for society, everyone has their freedom of speech, but Nazis cannot harass decent people. They can tell each other pretty lies, where being responsible for killing more than 138 thousand Americans is patriotism, but 4 is treason, where the state brutalizing people expressing their 1st amendment rights is freedom, but wearing a mask not to risk the lives of others is tyranny. At least we do not have to listen to the insanity. (The police should naturally listen to stop crime.)

Many of the societal problems of Facebook and Co. would be much reduced if we would legislate that such large networks open up to competition by implementing open communication protocols like ActivityPub. Then they would be forced to deliver a good product to keep their customers. If they do not change many will flee the repulsive violent conspiracy surveilance hell they were only still part of to be able to talk to grandma.

Because there are nearly no Nazis and other unfriendly characters, the fediverse is very popular with groups they would otherwise harass and bully into silence. It is a colorful bunch. This illustrates that extending the right to free speech to the right to be amplified by others does not optimize the freedom of speech, but in reality excludes many voices.

A short encore: the coders of the ActivityPub apps also do not like Nazis. So they hard coded Nazi blocks into their apps. It is open source software, so the Nazis can remove this, but Google and Apple will not accept their apps. The latter is the societal problem, the coders are perfectly in their right not to want their work be used to destroy civilization.

Open Science

The fediverse looks a lot like the Open Science tool universe I am dreaming of. Many independent groups and servers that seamlessly communicate with each other. The Grassroots post-publication peer review system I am working on should be able to gather reviews from all the other review and endorsement systems. They and repositories should be able to display grassroots reviews.

The reviews could be aided by displaying information on retractions from the Retraction Watch database. I hope someone will build a service that also warns when a cited article is retracted. The review could show or link to open citations of the article and statistics checks, as well as plagiarism and figure tampering checks.

We could have systems that warn authors of new articles and manuscripts they may find interesting given their publication history and warn editors of manuscripts that fit to their journal. I recently made a longer list of useful integrations and services and put it on Zenodo.

These could all be independent services that work together via ActivityPub and APIs, but the legacy publishers are working on collaborative science pipelines that create network effects, to ensure you are forced to use the largest service where you colleagues are and cannot leave, just like Facebook, Google and Twitter.


A mastodon with a paperplane in its trunk.
I am explaining all this to illustrate that such a federated social network is much better for society and its users. I really like the atmosphere on Mastodon. You can have real conversations with interesting people, without lunatics jumping in between or groups being pitted against each other. If people hear less and less of me on Twitter, that is one of the reasons.

So I hope that this kind of network is the future and to help getting there we have started a Mastodon server for publishing scientists. "We" is me and former meteorologist Frank Sonntag who leads a small digital services company, AKM-services. So for him setting up a Mastodon server was easy.

Two years ago he had to drag me to Mastodon a bit, when we tried to set up a server just for the Earth Sciences. That did not work out. By now that I have learned to love Mastodon, it has gotten a lot bigger and more people are aware of the societal problems due to social media. So it is time for another try with a larger target audience, all scientists. We have called it: FediScience.

Mastodon is still quite small with about half a million active users; Twitter is 100 times bigger. My impression is that at least many climate scientists are on Twitter for science communication. For many leaving Twitter is not yet a realistic option, but FediScience could be a friendly place to talk to colleagues, nerd out about detailed science, while staying on Twitter for more comprehensible Tweets on the main findings.

Once we have a nice group together, we can together decide on the local rules. How we would like to moderate, who will do the moderation, with whom our server federates, who is welcome, how long the messages are, whether we want equations, ... In the end I hope the server will be run by an association with the users as members.

My network empire

My solution to Mastodon still being small was to stay on Twitter to talk about climate science, the political problems leading to the climate branch of the American culture war and anything that comes up on this blog: Variable Variability. As the goal of my Mastodon account in Bonn is to build a local network for a digital non-profit, there I talk about the open web, data privacy more, often write in German and only occasionally write about climate. I aim to use my new account at FediScience to talk about (open) science and to enjoy finally a captive audience that understands the statistics of variability. As administrator I will try to help people find their way in the fediverse.

Next to this the grassroots open review journals are on Mastodon, Twitter and Reddit. And I have inherited the Open Science Feed from Jon Tennant, which is on Mastodon, Twitter and Reddit. Both deserve to get an IndieWeb homepage and a newsletter, but all newsletters I know are full of trackers, suggestions for ethical ones are welcome. For even more fun, I also created a Twitter feed for the climate statistics blog Tamino and scientific skeptic Potholer54's YouTube channel. I should probably put them on Mastodon as well. That makes this blog my 12th social media channel. Pro-tip: with Firefox "containers" you can be logged in into multiple Mastodon, Twitter or Reddit accounts.

Every member of FediScience can invite their colleagues to join the network. Please do. If you share the link in public, please make it time limited.

Please let other scientists know about FediScience, whether by mail or via one of the social media silos. These are good Tweets to spread:



When you join Mastodon, the following glossary is helpful.

The Bird Site Twitter
Fediverse All federated social media sites together
Instance Sever running Mastodon
Toot Tweet
Boost Retweet
ActivityPub (AP) The main communication protocol in the fediverse
Content Warning (CW)A convenient way to give a heads up A mirror site of Twitter without tracking, popular for linking to in Mastodon


Friday 29 May 2020

What does statistical homogenization tell us about the underestimated global warming over land?

Climate station data contains inhomogeneities, which are detected and corrected by comparing a candidate station to its neighbouring reference stations. The most important inhomogeneities are the ones that lead to errors in the station network-wide trends and in global trend estimates. 

An earlier post in this series argued that statistical homogenization will tend to under-correct errors in the network-wide trends in the raw data. Simply put: that some of the trend error will remain. The catalyst for this series is the new finding that when the signal to noise ratio is too low, homogenization methods will have large errors in the positions of the jumps/breaks. For much of the earlier data and for networks in poorer countries this probably means that any trend errors will be seriously under-corrected, if they are corrected at all.

The questions for this post are: 1) What do the corrections in global temperature datasets do to the global trend and 2) What can we learn from these adjustments for global warming estimates?

The global warming trend estimate

In the global temperature station datasets statistical homogenization leads to larger warming estimates. So as we tend to underestimate how much correction is needed, this suggests that the Earth warmed up more than current estimates indicate.

Below is the warming estimate in NOAA’s Global Historical Climate Network (Versions 3 and 4) from Menne et al. (2018). You see the warming in the “raw data” (before homogenization; striped lines) and in the homogenized data (drawn line). The new version 4 is drawn in black, the previous version 3 in red. For both versions homogenization makes the estimated warming larger.

After homogenization the warming estimates of the two versions are quite similar. The difference is in the raw data. Version 4 is based on the raw data of the International Surface Temperature Initiative and has much more stations. Version 3 had many stations that report automatically, these are typically professional stations and a considerable part of them are at airports. One reason the raw data may show less warming in Version 3 is that many stations at airports were in cities before. Taking them out of the urban heat island and often also improving the local siting of the station, may have produced a systematic artificial cooling in the raw observations.

Version 4 has more stations and thus a higher signal to noise ratio. One may thus expect it to show more warming. That this is not the case is a first hint that the situation is not that simple, as explained at the end of this post.

Figure from Menne et al. with warming estimates from 1880. See caption below.
The global land warming estimates based on the Global Historical Climate Network dataset of NOAA. The red lines are for version 3, the black lines for the new version 4. The striped lines are before homogenization and the drawn lines after homogenization. Figure from Menne et al. (2018).

The difference due to homogenization in the global warming estimates is shown in the figure below, also from Menne et al. (2018). The study also added an estimate for the data of the Berkeley Earth initiative.

(Background information. Berkeley Earth started as a US Culture War initiative where non-climatologists computed the observed global warming. Before the results were in, climate “sceptics” claimed their methods were the best and they would accept any outcome. The moment the results turned out to be scientifically correct, but not politically correct, the climate “sceptics” dropped them like a hot potato.)

We can read from the figure that in GHCNv3 over the full period homogenization increases warming estimates by about 0.3 °C per century, while this is 0.2°C in GHCNv4 and 0.1°C in the dataset of Berkeley Earth datasets. GHCNv3 has more than 7000 stations (Lawrimore et al., 2011). GHCNv4 is based on the ISTI dataset (Thorne et al., 2011), which has about 32,000 stations, but GHCN only uses those of at least 10 years and thus contains about 26,000 stations (Menne et al. 2018). Berkeley Earth is based on 35,000 stations (Rohde et al., 2013).

Figure from Menne et al. (2018) showing how much adjustments were made.
The difference due to homogenization in the global warming estimates (Menne et al., 2018). The red line is for smaller GHCNv3 dataset, the black line for GHCNv4 and the blue line for Berkeley Earth.

What does this mean for global warming estimates?

So, what can we learn from these adjustments for global warming estimates? At the moment, I am afraid, not yet a whole lot. However, the sign is quite likely right. If we could do a perfect homogenization, I expect that this would make the warming estimates larger. But to estimate how large the correction should have been based on the corrections which were actually made in the above datasets is difficult.

In the beginning, I was thinking: if the signal to noise ratio in some network is too low, we may be able to estimate that in such a case we under-correct, say, 50% and then make the adjustments unbiased by making them, say, twice as large.

However, especially doing this globally is a huge leap of faith.

The first assumption this would make is that the trend bias in data sparse regions and periods is the same as that of data rich regions and periods. However, the regions with high station density are in the [[mid-latitudes]] where atmospheric measurements are relatively easy. The data sparse periods are also the periods in which large changes in the instrumentation were made as we were still learning how to make good meteorological observations. So we cannot reliably extrapolate from data rich regions and periods to data sparse regions and periods. 

Furthermore, there will not be one correction factor to account for under-correction because the signal to noise ratio is different everywhere. Maybe America is only under-corrected by 10% and needs just a little nudge to make the trend correction unbiased. However, homogenization adjustments in data sparse regions may only be able to correct such a small part of the trend bias that correcting for the under-correction becomes adventurous or even will make trend estimates more uncertain. So we would at least need to make such computations for many regions and periods.

Finally, another reason not to take such an estimate too seriously are the spatial and temporal characteristics of the bias. The signal to noise ratio is not the only problem. One would expect that it also matters how the network-wide trend bias is distributed over the network. In case of relocations of city stations to airports, a small number of stations will have a large jump. Such a large jump is relatively easy to detect, especially as its neighbouring stations will mostly be unaffected.

Already a harder case is the time of observation bias in America, where a large part of the stations has experienced a cooling shift from afternoon to morning measurements over many decades. Here, in most cases the neighbouring stations were not affected around the same time, but the smaller shift makes it harder to detect these breaks.

(NOAA has a special correction for this problem, but when it is turned off statistical homogenization still finds the same network-wide trend. So for this kind of bias the network density in America is apparently sufficient.)

Among the hardest case are changes in the instrumentation. For example, the introduction of Automatic Weather Stations in the last decades or the introduction of the Stevenson screen a century ago. These relatively small breaks often happen over a period of only a few decades, if not years, which means that also the neighbouring stations are affected. That makes it hard to detect them in a difference time series.

Studying from the data how the biases are distributed is hard. One could study this by homogenizing the data and studying the breaks, but the ones which are difficult to detect will then be under-represented. This is a tough problem; please leave suggestions in the comments.

Because of how the biases are distributed it is perfectly possible that the trend biases corrected in GHCN and Berkley Earth are due to the easy-to-correct problems, such as the relocations to airports, while the hard ones, such as the transition to Stevenson screens, are hardly corrected. In this case, the correction that could be made, do not provide information on the ones that could not be made. They have different causes and different difficulties.

So if we had a network where the signal to noise ratio is around one, we could not say that the under-correction is, say, 50%. One would have to specify for which kind of distribution of the bias this is valid.

GHCNv3, GHCNv4 and Berkeley Earth

Coming back to the trend estimates of GHCN version 3 and version 4. One may have expected that version 4 is able to better correct trend biases, having more stations, and should thus show a larger trend than version 3. This would go even more so for Berkeley Earth. But the final trend estimates are quite similar. Similarly in the most data rich period after the second world war, the least corrections are made.

The datasets with the largest number of stations showing the strongest trend would have been a reasonable expectation if the trend estimates of the raw data would have been similar. But these raw data trends are the reason for the differences in the size of the corrections, while the trend estimates based on the homogenized are quite similar.

Many additional stations will be in regions and periods where we already had many stations and where the station density was no problem. On the other hand, adding some stations to data sparse regions may not be sufficient to fix the low signal to noise ratio. So the most improvements would be expected for the moderate cases where the signal to noise ratio is around one. Until we have global estimates of the signal to noise ratio for these datasets, we do not know for which percentage of stations this is relevant, but this could be relatively small.

The arguments of the previous section are also applicable here; the relationship between station density and adjustments may not be that easy. Especially that the corrections in the period after the second world war are so small is suspicious; we know quite a lot happened to the measurement networks. Maybe these effects all average out, but that would be quite a coincidence. Another possibility is that these changes in observational methods were made over relatively short periods to entire networks making it hard to correct them.

A reason for the similar outcomes for the homogenized data could be that all datasets successfully correct for trend biases due to problems like the transition to airports, while for every dataset the signal to noise ratio is not enough to correct problems like the transition to Stevenson screens. GHNCv4 and Berkeley Earth using as many stations as they could find could well have more stations which are currently badly sited than GHCNv3, which was more selective. In that case the smaller effective corrections of these two datasets would be due to compensating errors.

Finally, as small disclaimer: The main change from version 3 to 4 was the number of stations, but there were other small changes, so it is not just a comparison of two datasets where only the signal to noise ratio is different. Such a pure comparison still needs to be made. The homogenization methods of GHCN and Berkeley Earth are even more different.

My apologies for all the maybe's and could be's, but this is something that is more complicated than it may look and I would not be surprised if it will turn out to be impossible to estimate how much corrections are needed based on the corrections that are made by homogenization algorithms. The only thing I am confident about is that homogenization improves trend estimates, but I am not confident about how much it improves.

Parallel measurements

Another way to study these biases in the warming estimates is to go into the books and study station histories in 200 plus countries. This is basically how sea surface temperature records are homogenized. To do this for land stations is a much larger project due to the large number of countries and languages.

Still there are such experiments, which give a first estimate for some of the biases when it comes to the global mean temperature (do not expect regional detail). In the next post I will try to estimate the missing warming this way. We do not have much data from such experiments yet, but I expect that this will be the future.

Other posts in this series


Chimani, Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683.

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russel S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal of Geophysical Research, 116, D19121.

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: Article:

Rohde, Robert, Richard A. Muller, Robert Jacobsen, Elizabeth Muller, Saul Perlmutter, Arthur Rosenfeld, Jonathan Wurtele, Donald Groom and Charlotte Wickham, 2013: A New Estimate of the Average Earth Surface Land Temperature Spanning 1753 to 2011. Geoinformatics & Geostatistics: An Overview, 1, no.1.

Sutton, Rowan, Buwen Dong and Jonathan Gregory, 2007: Land/sea warming ratio in response to climate change: IPCC AR4 model results and comparison with observations. Geophysical Research Letters, 34, L02701.

Thorne, Peter W., Kate M. Willett, Rob J. Allan, Stephan Bojinski, John R. Christy, Nigel Fox, Simon Gilbert, Ian Jolliffe, John J. Kennedy, Elizabeth Kent, Albert Klein Tank, Jay Lawrimore, David E. Parker, Nick Rayner, Adrian Simmons, Lianchun Song, Peter A. Stott and Blair Trewin, 2011: Guiding the creation of a comprehensive surface temperature resource for twenty-first century climate science. Bulletin American Meteorological Society, 92, ES40–ES47.

Wallace, Craig and Manoj Joshi, 2018: Comparison of land–ocean warming ratios in updated observed records and CMIP5 climate models. Environmental Research Letters, 13, no. 114011. 

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116.