Sunday, 26 July 2020

Micro-blogging for scientists without nasties and surveillance

Start screen picture of Mastodon: A Mastodon playing with paper airplanes.

Two years ago I joined Mastodon to get to know a more diverse group of people here in Bonn. Almost two thousand messages later, I can say I really like it there and social networks like Mastodon are much more healthy for society as well. Together with Frank Sonntag we have recently set up a Mastodon server for publishing scientists. Let me explain how it works and why this system is better for the users and society.

Mastodon looks a lot like Twitter, i.e. it is a micro-blogging system, but many tweaks make it a much friendlier place where you can have meaningful conversations. One exemplary difference is that there are no quote tweets. Quoting rather than simply replying is often used by large accounts to bully small ones by pulling in many people into the "conversation" who disagree. I do miss quote tweets, they can also be used for good, to highlight what is interesting about a tweet or to explain something that the writer assumed their readers know, but your readers may not know. But quote tweets make the atmosphere more adversarial, less about understanding and talking with each other. Conflict leads to more engagement and more time on the social network, so Twitter and Facebook like it, but pitting groups against each other is not the public debate that makes humanity better.

The main difference under the hood is that the system is not controlled by one corporation. There is not one server, but many servers that seamlessly talk with each other, just like the email system. The communication protocol (ActivityPub) is a standard of the World Wide Web Consortium, just like HTML and HTTPS, which powers the web.

This means that you can chose the server and interface you like and still talk to others, while people on Twitter, Facebook, Instagram, WordPress and Tumblr can only talk to other people in their silo. As they say the modern internet is a group of five websites, each consisting of screenshots from the other four. It is hard to leave these silos, it would cut you off from your friends. This is also why the system naturally evolves into a few major players. Their service is as bad as one would expect with the monopoly power this network effect gives them.

The Fediverse and its soial networks as icons

ActivityPub is not only used by Mastodon, but also by other micro-blogging social networks such as Pleroma, blogging networks such as, podcasting services such as FunkWhale and file hosting such as NextCloud. There is a version of Instagram (PixelFed) and of YouTube (PeerTube). With ActivityPub all these social networks can talk to each other. Where they do different things, the system is designed to degrade gracefully. FixelFed shows photos more beautifully, has collections and filters, but Mastodon gracefully shows the recent photos as a photo below a message. PeerTube shows one large video on a page, just like Twitter, Mastodon shows the newest videos in small below a message in the news feed. The full network is called the fediverse, a portmanteau of federation and universe.

Currently all these services are ad-free and tracking-free. The coding of the open source software is largely a labor of love, even if some coders are supported by micro-funding, for example Patreon or Liberapay. Most servers are maintained by people as hobby, some (like for email) by organization for their members, some larger ones again use Patreon or Liberapay, some are even coops.

This means that technology enthusiasts from the middle class are mostly behind these networks. That is better than a few large ad corporations, but still not as democratic as one would like for such an important part of our society.


Not only can these networks talk to each other, they also themselves consist of many different servers each maintained by another group, just as the email system. This means that moderation of the content is much better than on Twitter or Facebook. The owners of the servers want to create a functional community, while these communities are relatively small. So they can invest much more time per moderation decision than a commercial silo would. Also if the moderation fails, people will go somewhere else.

Individual moderation decisions only pertain one server and are thus less impactful and can consequently be more forceful. If you do not like the moderation, you can move to another server that fits your values better. If you are kicked off a server, you can go to another one and still talk to your friends. Facebook kicking someone off Facebook or Twitter kicking someone off Twitter is somewhat of a big deal and is thus only done in extreme cases, when someone already created a lot of damage to the social fabric, while others make the atmosphere toxic staying below the radar.

If someone is really annoying they may naturally be removed from many servers. Then it does become a problem for this person, but that only happens when many server administrator agree you are not welcome. So maybe that person is really not an enrichment for humanity.

The extreme example would be Nazis. Some Nazis were too extreme for Twitter and started their own micro-blogging network. Probably most Nazis know the name already, but I think it is a good policy not to help bad actors with PR. As this network was used to coordinate their violent and inhumane actions, Google and Apple have removed their apps from their app stores. I may like that outcome, but these corporations should not have that power. Next this network started using ActivityPub, so that they can use ActivityPub apps. The main Activity network does not like Nazis, so they all blocked this network.

I feel this is a good solution for society, everyone has their freedom of speech, but Nazis cannot harass decent people. They can tell each other pretty lies, where being responsible for killing more than 138 thousand Americans is patriotism, but 4 is treason, where the state brutalizing people expressing their 1st amendment rights is freedom, but wearing a mask not to risk the lives of others is tyranny. At least we do not have to listen to the insanity. (The police should naturally listen to stop crime.)

Many of the societal problems of Facebook and Co. would be much reduced if we would legislate that such large networks open up to competition by implementing open communication protocols like ActivityPub. Then they would be forced to deliver a good product to keep their customers. If they do not change many will flee the repulsive violent conspiracy surveilance hell they were only still part of to be able to talk to grandma.

Because there are nearly no Nazis and other unfriendly characters, the fediverse is very popular with groups they would otherwise harass and bully into silence. It is a colorful bunch. This illustrates that extending the right to free speech to the right to be amplified by others does not optimize the freedom of speech, but in reality excludes many voices.

A short encore: the coders of the ActivityPub apps also do not like Nazis. So they hard coded Nazi blocks into their apps. It is open source software, so the Nazis can remove this, but Google and Apple will not accept their apps. The latter is the societal problem, the coders are perfectly in their right not to want their work be used to destroy civilization.

Open Science

The fediverse looks a lot like the Open Science tool universe I am dreaming of. Many independent groups and servers that seamlessly communicate with each other. The Grassroots post-publication peer review system I am working on should be able to gather reviews from all the other review and endorsement systems. They and repositories should be able to display grassroots reviews.

The reviews could be aided by displaying information on retractions from the Retraction Watch database. I hope someone will build a service that also warns when a cited article is retracted. The review could show or link to open citations of the article and statistics checks, as well as plagiarism and figure tampering checks.

We could have systems that warn authors of new articles and manuscripts they may find interesting given their publication history and warn editors of manuscripts that fit to their journal. I recently made a longer list of useful integrations and services and put it on Zenodo.

These could all be independent services that work together via ActivityPub and APIs, but the legacy publishers are working on collaborative science pipelines that create network effects, to ensure you are forced to use the largest service where you colleagues are and cannot leave, just like Facebook, Google and Twitter.


A mastodon with a paperplane in its trunk.
I am explaining all this to illustrate that such a federated social network is much better for society and its users. I really like the atmosphere on Mastodon. You can have real conversations with interesting people, without lunatics jumping in between or groups being pitted against each other. If people hear less and less of me on Twitter, that is one of the reasons.

So I hope that this kind of network is the future and to help getting there we have started a Mastodon server for publishing scientists. "We" is me and former meteorologist Frank Sonntag who leads a small digital services company, AKM-services. So for him setting up a Mastodon server was easy.

Two years ago he had to drag me to Mastodon a bit, when we tried to set up a server just for the Earth Sciences. That did not work out. By now that I have learned to love Mastodon, it has gotten a lot bigger and more people are aware of the societal problems due to social media. So it is time for another try with a larger target audience, all scientists. We have called it: FediScience.

Mastodon is still quite small with about half a million active users; Twitter is 100 times bigger. My impression is that at least many climate scientists are on Twitter for science communication. For many leaving Twitter is not yet a realistic option, but FediScience could be a friendly place to talk to colleagues, nerd out about detailed science, while staying on Twitter for more comprehensible Tweets on the main findings.

Once we have a nice group together, say after the boreal summer holidays, we can together decide on the local rules. How we would like to moderate, who will do the moderation, with whom our server federates, who is welcome, how long the messages are, whether we want equations, ...

My network empire

My solution to Mastodon still being small was to stay on Twitter to talk about climate science, the political problems leading to the climate branch of the American culture war and anything that comes up on this blog: Variable Variability. As the goal of my Mastodon account in Bonn is to build a local network for a digital non-profit, there I talk about the open web, data privacy more, often write in German and only occasionally write about climate. I aim to use my new account at FediScience to talk about (open) science and to enjoy finally a captive audience that understands the statistics of variability. As administrator I will try to help people find their way in the fediverse.

Next to this the grassroots open review journals are on Mastodon, Twitter and Reddit. And I have inherited the Open Science Feed from Jon Tennant, which is on Mastodon, Twitter and Reddit. Both deserve to get an IndieWeb homepage and a newsletter, but all newsletters I know are full of trackers, suggestions for ethical ones are welcome. For even more fun, I also created a Twitter feed for the climate statistics blog Tamino and scientific skeptic Potholer54's YouTube channel. I should probably put them on Mastodon as well. That makes this blog my 12th social media channel. Pro-tip: with Firefox "containers" you can be logged in into multiple Mastodon, Twitter or Reddit accounts.

Every member of FediScience can invite their colleagues to join the network. Please do. This is my invitation link for you. (If you share the link in public, please make it time limited.)

Please let other scientists know about FediScience, whether by mail or via one of the social media silos. These are good Tweets to spread:


When you join Mastodon, the following glossary is helpful.

The Bird Site Twitter
Fediverse All federated social media sites together
Instance Sever running Mastodon
Toot Tweet
Boost Retweet
ActivityPub (AP) The main communication protocol in the fediverse
Content Warning (CW)A convenient way to give a heads up A mirror site of Twitter without tracking, popular for linking to in Mastodon


Friday, 29 May 2020

What does statistical homogenization tell us about the underestimated global warming over land?

Climate station data contains inhomogeneities, which are detected and corrected by comparing a candidate station to its neighbouring reference stations. The most important inhomogeneities are the ones that lead to errors in the station network-wide trends and in global trend estimates. 

An earlier post in this series argued that statistical homogenization will tend to under-correct errors in the network-wide trends in the raw data. Simply put: that some of the trend error will remain. The catalyst for this series is the new finding that when the signal to noise ratio is too low, homogenization methods will have large errors in the positions of the jumps/breaks. For much of the earlier data and for networks in poorer countries this probably means that any trend errors will be seriously under-corrected, if they are corrected at all.

The questions for this post are: 1) What do the corrections in global temperature datasets do to the global trend and 2) What can we learn from these adjustments for global warming estimates?

The global warming trend estimate

In the global temperature station datasets statistical homogenization leads to larger warming estimates. So as we tend to underestimate how much correction is needed, this suggests that the Earth warmed up more than current estimates indicate.

Below is the warming estimate in NOAA’s Global Historical Climate Network (Versions 3 and 4) from Menne et al. (2018). You see the warming in the “raw data” (before homogenization; striped lines) and in the homogenized data (drawn line). The new version 4 is drawn in black, the previous version 3 in red. For both versions homogenization makes the estimated warming larger.

After homogenization the warming estimates of the two versions are quite similar. The difference is in the raw data. Version 4 is based on the raw data of the International Surface Temperature Initiative and has much more stations. Version 3 had many stations that report automatically, these are typically professional stations and a considerable part of them are at airports. One reason the raw data may show less warming in Version 3 is that many stations at airports were in cities before. Taking them out of the urban heat island and often also improving the local siting of the station, may have produced a systematic artificial cooling in the raw observations.

Version 4 has more stations and thus a higher signal to noise ratio. One may thus expect it to show more warming. That this is not the case is a first hint that the situation is not that simple, as explained at the end of this post.

Figure from Menne et al. with warming estimates from 1880. See caption below.
The global land warming estimates based on the Global Historical Climate Network dataset of NOAA. The red lines are for version 3, the black lines for the new version 4. The striped lines are before homogenization and the drawn lines after homogenization. Figure from Menne et al. (2018).

The difference due to homogenization in the global warming estimates is shown in the figure below, also from Menne et al. (2018). The study also added an estimate for the data of the Berkeley Earth initiative.

(Background information. Berkeley Earth started as a US Culture War initiative where non-climatologists computed the observed global warming. Before the results were in, climate “sceptics” claimed their methods were the best and they would accept any outcome. The moment the results turned out to be scientifically correct, but not politically correct, the climate “sceptics” dropped them like a hot potato.)

We can read from the figure that in GHCNv3 over the full period homogenization increases warming estimates by about 0.3 °C per century, while this is 0.2°C in GHCNv4 and 0.1°C in the dataset of Berkeley Earth datasets. GHCNv3 has more than 7000 stations (Lawrimore et al., 2011). GHCNv4 is based on the ISTI dataset (Thorne et al., 2011), which has about 32,000 stations, but GHCN only uses those of at least 10 years and thus contains about 26,000 stations (Menne et al. 2018). Berkeley Earth is based on 35,000 stations (Rohde et al., 2013).

Figure from Menne et al. (2018) showing how much adjustments were made.
The difference due to homogenization in the global warming estimates (Menne et al., 2018). The red line is for smaller GHCNv3 dataset, the black line for GHCNv4 and the blue line for Berkeley Earth.

What does this mean for global warming estimates?

So, what can we learn from these adjustments for global warming estimates? At the moment, I am afraid, not yet a whole lot. However, the sign is quite likely right. If we could do a perfect homogenization, I expect that this would make the warming estimates larger. But to estimate how large the correction should have been based on the corrections which were actually made in the above datasets is difficult.

In the beginning, I was thinking: if the signal to noise ratio in some network is too low, we may be able to estimate that in such a case we under-correct, say, 50% and then make the adjustments unbiased by making them, say, twice as large.

However, especially doing this globally is a huge leap of faith.

The first assumption this would make is that the trend bias in data sparse regions and periods is the same as that of data rich regions and periods. However, the regions with high station density are in the [[mid-latitudes]] where atmospheric measurements are relatively easy. The data sparse periods are also the periods in which large changes in the instrumentation were made as we were still learning how to make good meteorological observations. So we cannot reliably extrapolate from data rich regions and periods to data sparse regions and periods. 

Furthermore, there will not be one correction factor to account for under-correction because the signal to noise ratio is different everywhere. Maybe America is only under-corrected by 10% and needs just a little nudge to make the trend correction unbiased. However, homogenization adjustments in data sparse regions may only be able to correct such a small part of the trend bias that correcting for the under-correction becomes adventurous or even will make trend estimates more uncertain. So we would at least need to make such computations for many regions and periods.

Finally, another reason not to take such an estimate too seriously are the spatial and temporal characteristics of the bias. The signal to noise ratio is not the only problem. One would expect that it also matters how the network-wide trend bias is distributed over the network. In case of relocations of city stations to airports, a small number of stations will have a large jump. Such a large jump is relatively easy to detect, especially as its neighbouring stations will mostly be unaffected.

Already a harder case is the time of observation bias in America, where a large part of the stations has experienced a cooling shift from afternoon to morning measurements over many decades. Here, in most cases the neighbouring stations were not affected around the same time, but the smaller shift makes it harder to detect these breaks.

(NOAA has a special correction for this problem, but when it is turned off statistical homogenization still finds the same network-wide trend. So for this kind of bias the network density in America is apparently sufficient.)

Among the hardest case are changes in the instrumentation. For example, the introduction of Automatic Weather Stations in the last decades or the introduction of the Stevenson screen a century ago. These relatively small breaks often happen over a period of only a few decades, if not years, which means that also the neighbouring stations are affected. That makes it hard to detect them in a difference time series.

Studying from the data how the biases are distributed is hard. One could study this by homogenizing the data and studying the breaks, but the ones which are difficult to detect will then be under-represented. This is a tough problem; please leave suggestions in the comments.

Because of how the biases are distributed it is perfectly possible that the trend biases corrected in GHCN and Berkley Earth are due to the easy-to-correct problems, such as the relocations to airports, while the hard ones, such as the transition to Stevenson screens, are hardly corrected. In this case, the correction that could be made, do not provide information on the ones that could not be made. They have different causes and different difficulties.

So if we had a network where the signal to noise ratio is around one, we could not say that the under-correction is, say, 50%. One would have to specify for which kind of distribution of the bias this is valid.

GHCNv3, GHCNv4 and Berkeley Earth

Coming back to the trend estimates of GHCN version 3 and version 4. One may have expected that version 4 is able to better correct trend biases, having more stations, and should thus show a larger trend than version 3. This would go even more so for Berkeley Earth. But the final trend estimates are quite similar. Similarly in the most data rich period after the second world war, the least corrections are made.

The datasets with the largest number of stations showing the strongest trend would have been a reasonable expectation if the trend estimates of the raw data would have been similar. But these raw data trends are the reason for the differences in the size of the corrections, while the trend estimates based on the homogenized are quite similar.

Many additional stations will be in regions and periods where we already had many stations and where the station density was no problem. On the other hand, adding some stations to data sparse regions may not be sufficient to fix the low signal to noise ratio. So the most improvements would be expected for the moderate cases where the signal to noise ratio is around one. Until we have global estimates of the signal to noise ratio for these datasets, we do not know for which percentage of stations this is relevant, but this could be relatively small.

The arguments of the previous section are also applicable here; the relationship between station density and adjustments may not be that easy. Especially that the corrections in the period after the second world war are so small is suspicious; we know quite a lot happened to the measurement networks. Maybe these effects all average out, but that would be quite a coincidence. Another possibility is that these changes in observational methods were made over relatively short periods to entire networks making it hard to correct them.

A reason for the similar outcomes for the homogenized data could be that all datasets successfully correct for trend biases due to problems like the transition to airports, while for every dataset the signal to noise ratio is not enough to correct problems like the transition to Stevenson screens. GHNCv4 and Berkeley Earth using as many stations as they could find could well have more stations which are currently badly sited than GHCNv3, which was more selective. In that case the smaller effective corrections of these two datasets would be due to compensating errors.

Finally, as small disclaimer: The main change from version 3 to 4 was the number of stations, but there were other small changes, so it is not just a comparison of two datasets where only the signal to noise ratio is different. Such a pure comparison still needs to be made. The homogenization methods of GHCN and Berkeley Earth are even more different.

My apologies for all the maybe's and could be's, but this is something that is more complicated than it may look and I would not be surprised if it will turn out to be impossible to estimate how much corrections are needed based on the corrections that are made by homogenization algorithms. The only thing I am confident about is that homogenization improves trend estimates, but I am not confident about how much it improves.

Parallel measurements

Another way to study these biases in the warming estimates is to go into the books and study station histories in 200 plus countries. This is basically how sea surface temperature records are homogenized. To do this for land stations is a much larger project due to the large number of countries and languages.

Still there are such experiments, which give a first estimate for some of the biases when it comes to the global mean temperature (do not expect regional detail). In the next post I will try to estimate the missing warming this way. We do not have much data from such experiments yet, but I expect that this will be the future.

Other posts in this series


Chimani, Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683.

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russel S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal of Geophysical Research, 116, D19121.

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: Article:

Rohde, Robert, Richard A. Muller, Robert Jacobsen, Elizabeth Muller, Saul Perlmutter, Arthur Rosenfeld, Jonathan Wurtele, Donald Groom and Charlotte Wickham, 2013: A New Estimate of the Average Earth Surface Land Temperature Spanning 1753 to 2011. Geoinformatics & Geostatistics: An Overview, 1, no.1.

Sutton, Rowan, Buwen Dong and Jonathan Gregory, 2007: Land/sea warming ratio in response to climate change: IPCC AR4 model results and comparison with observations. Geophysical Research Letters, 34, L02701.

Thorne, Peter W., Kate M. Willett, Rob J. Allan, Stephan Bojinski, John R. Christy, Nigel Fox, Simon Gilbert, Ian Jolliffe, John J. Kennedy, Elizabeth Kent, Albert Klein Tank, Jay Lawrimore, David E. Parker, Nick Rayner, Adrian Simmons, Lianchun Song, Peter A. Stott and Blair Trewin, 2011: Guiding the creation of a comprehensive surface temperature resource for twenty-first century climate science. Bulletin American Meteorological Society, 92, ES40–ES47.

Wallace, Craig and Manoj Joshi, 2018: Comparison of land–ocean warming ratios in updated observed records and CMIP5 climate models. Environmental Research Letters, 13, no. 114011. 

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116.

Friday, 1 May 2020

Statistical homogenization under-corrects any station network-wide trend biases

Photo of a station of the US Climate Reference Network with a prominent wind shield for the rain gauges.
A station of the US Climate Reference Network.

In the last blog post I made the argument that the statistical detection of breaks in climate station data has problems when the noise is larger than the break signal. The post before argued that the best homogenization correction method we have can remove network-wide trend biases perfectly if all breaks are known. In the light of the last post, we naturally would like to know how well this correction method can remove such biases in the more realistic case when the breaks are imperfectly estimated. That should still be studied much better, but it is interesting to discuss a number of other studies on the removal of network-wide trend biases from the perspective of this new understanding.

So this post will argue that it theoretically makes sense that (unavoidable) inaccuracies of break detection lead to network-wide trend biases only being partially corrected by statistical homogenization.

1) We have seen this in our study of the correction method in response to small errors in the break positions (Lindau and Venema, 2018).

2) The benchmarking study of NOAA’s homogenization algorithm shows that if the breaks are big and easy they are largely removed, while in the scenario where breaks are plentiful and small half of the trend bias remains (Williams et al., 2012).

3) Another benchmarking study show that with the network density of Switzerland homogenization can find and remove clear trend biases, while if you thin this network to be similar to Peru the bias cannot be removed (Gubler et al., 2017).

4) Finally, a benchmarking study of relative humidity station observations in Austria could not remove much of the trend bias, which is likely because relative humidity is not correlated well from station to station (Chimani et al., 2018).

Statistical homogenization on a global scale makes warming estimates larger (Lawrimore et al., 2011; Menne et al., 2018). Thus if it can only remove part of any trend bias, this would mean that quite likely the actual warming was larger.

Figure 1: The inserted versus remaining network-mean trend error. Upper panel for perfect breaks. Lower panel for a small perturbation of the break position. The time series are 100 annual values and have 5 break. Figure 10 in Lindau and Venema (2018).

Joint correction method

First, what did our study on the correction method (Lindau and Venema, 2018) say about the importance of errors in the break position? As the paper was mostly about perfect breaks, we assumed that all breaks were known, but that they had a small error in their position. In the example to the right, we perturbed the break position by a normally distributed random number with standard deviation one (lower panel), while for comparison the breaks are perfect (upper panel).

In both cases we inserted a large network-wide trend bias of 0.873 °C over the length of the century long time series. The inserted errors for 1000 simulations is on the x-axis, the average inserted trend bias is denoted by x̅. The remaining error after homogenization is on the y-axis. Its average is denoted by y̅ and basically zero in case the breaks are perfect (top panel). In case of the small perturbation (lower panel) the average remaining error is 0.093 °C, this is 11 % of the inserted trend bias. That is the under-correction for is a quite small perturbation: 38 % of the positions is not changed at all.

If the standard deviation of the position perturbation is increased to 2, the remaining trend bias is 21 % of the inserted bias.

In the upper panel, there is basically no correlation between the inserted and the remaining error. That is, the remaining error does not depend on the break signal, but only on the noise. In the lower panel with the position errors, there is a correlation between the inserted and remaining trend error. So in this more realistic case, it does matter how large the trend bias due to the inhomogeneities is.

This is naturally an idealized case, position errors will be more complicated in reality and there would be spurious and missing breaks. But this idealized case fitted best to the aim of the paper of studying the correction algorithm in isolation.

It helps understand where the problem lies. The correction algorithm is basically a regression that aims to explain the inserted break signal (and the regional climate signal). Errors in the predictors will lead to an explained variance that is less than 100 %. One should thus expect that the estimated break signal is smaller than the actual break signal. It is thus expected that the trend change due to the estimated break signal produces is smaller than the actual trend change due to the inhomogeneities.

NOAA’s benchmark

That statistical homogenization under-corrects when the going gets tough is also found by the benchmarking study of NOAA’s Pairwise Homogenization Algorithm in Williams et al. (2012). They simulated temperature networks like the American USHCN network and added inhomogeneities according to a range of scenarios. (Also with various climate change signals.) Some scenarios were relatively easy, had few and large breaks, while others were hard and contained many small breaks. The easy cases were corrected nearly perfectly with respect to the network-wide trend, while in the hard cases only half of the inserted network-wide trend error was removed.

The results of this benchmarking for the three scenarios with a network-wide trend bias are shown below. The three panels are for the three scenarios. Each panel has results (the crosses, ignore the box plots) for three periods over which the trend error was computed. The main message is that the homogenized data (orange crosses) lies between the inhomogeneous data (red crosses) and the homogeneous data (green crosses). Put differently, green is how much the climate actually changed, red is how much the estimate is wrong due to inhomogeneities, orange shows that homogenization moves the estimate towards the truth, but never fully gets there.

If we use the number of breaks and their average size as a proxy for the difficulty of the scenario, the one on the left has 6.4 breaks with an average size of 0.8 °C, the one in the middle 8.4 breaks (size 0.4 °C) and the one on the right 10 breaks (size 0.4 °C). So this suggests there is a clear dose effect relationship; although there surely is more than just the number of breaks.

Figures from Williams et al. (2012) showing the results for three scenarios. This is a figure I created from parts of Figure 7 (left), Figure 5 (middle) and Figure 10 (right; their numbers).

When this study appeared in 2012, I found the scenario with the many small breaks much too pessimistic. However, our recent study estimating the properties of the inhomogeneities of the American network found a surprisingly large number of breaks: more than 17 per century; they were bigger: 0.5 °C. So purely based on the number of breaks the hardest scenario is even optimistic, but also size matters.

Not that I would already like to claim that even in a dense network like the American there is a large remaining trend bias and the actual warming was much larger. There is more to the difficulty of inhomogeneities than their number and size. It sure is worth studying.

Alpine benchmarks

The other two examples in the literature I know of are examples of under-correction in the sense of basically no correction because the problem is simply too hard. Gubler et al. (2017) shows that the raw data of the Swiss temperature network has a clear trend bias, which can be corrected with homogenization of its dense network (together with metadata), but when they thin the network to a network density similar to that of Peru, they are unable to correct this trend bias. For more details see my review of this article in the Grassroots Review Journal on Homogenization.

Finally, Chimani et al. (2018) study the homogenization of daily relative humidity observations in Austria. I made a beautiful daily benchmark dataset, it was a lot of fun: on a daily scale you have autocorrelations and a distribution with an upper and lower limit, which need to be respected by the homogeneous data and the inhomogeneous data. But already the normal homogenization of monthly averages was much too hard.

Austria has quite a dense network, but relative humidity is much influenced by very local circumstances and does not correlate well from station to station. My co-authors of the Austrian weather service wanted to write about the improvements: "an improvement of the data by homogenization was non‐ideal for all methods used". For me the interesting finding was: nearly no improvement was possible. That was unexpected. Had we expected that we could have generated a much simpler monthly or annual benchmark to show no real improvement was possible for humidity data and saved us a lot of (fun) work.

What does this mean for global warming estimates?

When statistical homogenization only partially removes large-scale trend biases what does this mean for global warming estimates? In the global temperature datasets statistical homogenization leads to larger warming estimates. So if we tend to underestimate how much correction is needed, this would mean that the Earth most likely warmed up more than current estimates indicate. How much exactly is hard to tell at the moment and thus needs a nuanced discussion. Let me give you my considerations in the next post.

Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization


Chimani Barbara, Victor Venema, Annermarie Lexer, Konrad Andre, Ingeborg Auer and Johanna Nemec, 2018: Inter-comparison of methods to homogenize daily relative humidity. International Journal Climatology, 38, pp. 3106–3122.

Gubler, Stefanie, Stefan Hunziker, Michael Begert, Mischa Croci-Maspoli, Thomas Konzelmann, Stefan Brönnimann, Cornelia Schwierz, Clara Oria and Gabriela Rosas, 2017: The influence of station density on climate data homogenization. International Journal of Climatology, 37, pp. 4670–4683.

Lawrimore, Jay H., Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russell S. Vose and Jared Rennie, 2011: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. Journal Geophysical Research, 116, D19121.

Lindau, Ralf and Victor Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript:, paywalled article:

Menne, Matthew J., Claude N. Williams, Byron E. Gleason, Jared J. Rennie and Jay H. Lawrimore, 2018: The Global Historical Climatology Network Monthly Temperature Dataset, Version 4. Journal of Climate, 31, 9835–9854.

Williams, Claude, Matthew Menne and Peter Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. Journal Geophysical Research, 117, D05116.

Monday, 27 April 2020

Break detection is deceptive when the noise is larger than the break signal

I am disappointed in science. It is impossible that it took this long for us to discover that break detection has serious problems when the signal to noise ratio is low. However, as far as we can judge this was new science and it certainly was not common knowledge, which it should have been because it has large consequences.

This post describes a paper by Ralf Lindau and me about how break detection depends on the signal to noise ratio (Lindau and Venema, 2018). The signal in this case are the breaks we would like to detect. These breaks could be from a change in instrument or location of the station. We detect breaks by comparing a candidate station to a reference. This reference can be one other neighbouring station or an average of neighbouring stations. The candidate and reference should be sufficiently close so that they have the same regional climate signal, which is then removed by subtracting the reference from the candidate. The difference time series that is left contains breaks and noise because of measurement uncertainties and differences in local weather. The noise thus depends on the quality of the measurements, on the density of the measurement network and on how variable the weather is spatially.

The signal to noise ratio (SNR) is simply defined as the standard deviation of the time series containing only the breaks divided by the standard deviation of time series containing only the noise. For short I will denote these as the break signal and the noise signal, which have a break variance and a noise variance. When generating data to test homogenization algorithms, you know exactly how strong the break signal and the noise signal is. In case of real data, you can estimate it, for example with the methods I described in a previous blog post. In that study, we found a signal to noise ratio for annual temperature averages observed in Germany of 3 to 4 and in America of about 5.

Temperature is studied a lot and much of the work on homogenization takes place in Europe and America. Here this signal to noise ratio is high enough. That may be one reason why climatologists did not find this problem sooner. Many other sciences use similar methods, we are all supported by a considerable statistical literature. I have no idea what their excuses are.

Why a low SNR is a problem

As scientific papers go, the discussion is quite mathematical, but the basic problem is relatively easy to explain in words. In statistical homogenization we do not know in advance where the break or breaks will be. So we basically try many break positions and search for the break positions that result in the largest breaks (or, for the algorithm we studied, that explain the most variance).

If you do this for a time series that contains only noise, this will also produce (small) breaks. For example, in case you are looking for one break, due to pure chance there will be a difference between the averages of the first and the last segment. This difference is larger than it would be for a predetermined break position, as we try all possible break positions and then select the one with the largest difference. To determine whether the breaks we found are real, we require that they are so large that it is unlikely that they are due to chance, while there are actually no breaks in the series. So we study how large breaks are in series that only contains noise to determine how large such random breaks are. Statisticians would talk about the breaks being statistically significant with white noise as the null hypothesis.

When the breaks are really large compared to the noise one can see by eye where the positions of the breaks are and this method is nice to make this computation automatically for many stations. When the breaks are “just” large, it is a great method to objectively determine the number of breaks and the optimal break positions.

The problem comes when the noise is larger than the break signal. Not that it is fundamentally impossible to detect such breaks. If you have a 100-year time series with a break in the middle, you would be averaging over 50 noise values on either side and the difference in their averages would be much smaller than the noise itself. Even if noise and signal are about the same size the noise effect is thus expected to be smaller than the size of such a break. To put it in another way, the noise is not correlated in time, while the break signal is the same for many years; that fundamental difference is what the break detection exploits.

However, to come to the fundamental problem, it becomes hard to determine the positions of the breaks. Imagine the theoretical case where the break positions are fully determined by the noise, not by the breaks. From the perspective of the break signal, these break positions are random. The problem is, also random breaks explain a part of the break signal. So one would have a combination with a maximum contribution of the noise plus a part of the break signal. Because of this additional contribution by the break signal, this combination may have larger breaks than expected in a pure noise signal. In other words, the result can be statistically significant, while we have no idea where the positions of the breaks are.

In a real case the breaks look even more statistically significant because the positions of the breaks are determined by both the noise and the break signal.

That is the fundamental problem, the test for the homogeneity of the series rightly detects that the series contains inhomogeneities, but if the signal to noise ratio is low we should not jump to conclusions and expect that the set of break positions that gives us the largest breaks has much to do with the break positions in the data. Only if the signal to noise ratio is high, this relationship is close enough.

Some numbers

This is a general problem, which I expect all statistical homogenization algorithms to have, but to put some numbers on this, we need to specify an algorithm. We have chosen to study the multiple breakpoint method that is implemented in PRODIGE (Caussinus and Mestre, 2004), HOMER (Mestre et al., 2013) and ACMANT (Domonkos and Coll, 2017), these are among the best, if not the best, methods we currently have. We applied it by comparing pairs of stations, like PRODIGE and HOMER do.

For a certain number of breaks this method effectively computes the combination of breaks that has the highest break variance. If you add more breaks, you will increase the break variance those breaks explain, even if it were purely due to noise, so there is additionally a penalty function that depends on the number of breaks. The algorithm selects that option where the break variance minus such a penalty is highest. A statistician would call this a model selection problem and the job of the penalty is to keep the statistical model (the step function explaining the breaks) reasonably simple.

In the end, if the signal to noise ratio is one half, the breaks that explain the largest breaks are just as “good” at explaining the actual break signal in the data as breaks at random positions.

With this detection model, we derived the plot below, let me talk you through this. On the x-axis is the SNR, on the right the break signal is twice as strong as the noise signal. On the y-axis is how well the step function belonging to the detected breaks fits to the step function of the breaks we actually inserted. The lower curve, with the plus symbols, is the detection algorithm as I described above. You can see that for a high SNR it finds a solution that closely matches what we put in and the difference is almost zero. The upper curve, with the ellipse symbols, is for the solution you find if you put in random breaks. You can see that for a high SNR the random breaks have a difference of 0.5. As the variance of the break signal is one, this means that half the variance of the break signal is explained by random breaks.

Figure 13b from Lindau and Venema (2018).

When the SNR is about 0.5, the random breaks are about as good as the breaks proposed by the algorithm described above.

One may be tempted to think that if the data is too noisy, the detection algorithm should detect less breaks, that is, the penalty function should be bigger. However, the problem is not the detection of whether there are breaks in the data, but where the breaks are. A larger penalty thus does not solve the problem and even makes the results slightly worse. Not in the paper, but later I wondered whether setting more breaks is such a bad thing, so we also tried lowering the threshold, this again made the results worse.

So what?

The next question is naturally: is this bad? One reason to investigate correction methods in more detail, as described in my last blog post, was the hope that maybe accurate break positions are not that important. It could have been that the correction method still produces good results even with random break positions. This is unfortunately not the case, already quite small errors in break positions deteriorate the outcome considerably, this will be the topic of the next post.

Not homogenizing the data is also not a solution. As I described in a previous blog post, the breaks in Germany are small and infrequent, but they still have a considerable influence on the trends of stations. The figure below shows the trend differences between many pairs of nearby stations in Germany. Their differences in trends will be mostly due to inhomogeneities. The standard deviation of 0.628 °C per century for the pairs translated to an average error in the trends of individual stations of 0.4 °C per century.

The trend differences (y-axis) of pairs of stations (x-axis) in the German temperature network. The trends were computed from 316 nearby pairs over 1950 to 2000. Figure 2 from Lindau and Venema (2018).

This finding makes it more important to work on methods to estimate the signal to noise ratio of a dataset before we try to homogenize it. This is easier said than done. The method introduced in Lindau and Venema (2018) gives results for every pair of stations, but needs some human checks to ensure the fits are good. Furthermore, it assumes the break levels behave like noise, while in Venema and Lindau (2019) we found that the break signal in the USA behaves like a random walk. This 2019 method needs a lot of data, even the results for Germany are already quite noisy, if you apply it to data sparse regions you have to select entire continents. Doing so, however, biases the results to those subregions were the there are many stations and would thus give too high SNR estimates. So computing SNR worldwide is not just a blog post, but requires a careful study and likely the development of a new method to estimate the break and noise variance.

Both methods compute the SNR for one difference time series, but in a real case multiple difference time series are used. We will need to study how to do this in an elegant way. How many difference series are used depends on the homogenization method, this would also make the SNR method dependent. I would appreciate to also have an estimation method that is more universal and can be used to compare networks with each other.

This estimation method should then be applied to global datasets and for various periods to study which regions and periods have a problem. Temperature (as well as pressure) are variables that are well correlated from station to station. Much more problematic variables, which should thus be studied as well, are precipitation, wind, humidity. In case of precipitation, there tend to be more stations. This will compensate some, but for the other variables there may even be less stations.

We have some ideas how to overcome this problem, from ways to increase the SNR to completely different ways to estimate the influence of inhomogeneities on the data. But they are too preliminary to already blog about. Do subscribe to the blog with any of the options below the tag cloud near the end of the page. ;-)

When we digitize climate data that is currently only available on paper, we tend to prioritize data from regions and periods where we do not have much information yet. However, if after that digitization the SNR would still be low, it may be more worthwhile to digitize data from regions/periods where we already have more data and get that region/period to a SNR above one.

The next post will be about how this low SNR problem changes our estimates of how much the Earth has been warming. Spoiler: the climate “sceptics” will not like that post.

Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization


Caussinus, Henri and Olivier Mestre, 2004: Detection and correction of artificial shifts in climate series. The Journal of the Royal Statistical Society, Series C (Applied Statistics), 53, pp. 405-425.

Domonkos, Peter and John Coll, 2017: Homogenisation of temperature and precipitation time series with ACMANT3: method description and efficiency tests. International Journal of Climatology, 37, pp. 1910-1921.

Lindau, Ralf and Victor Venema, 2018: The joint influence of break and noise variance on the break detection capability in time series homogenization. Advances in Statistical Climatology, Meteorology and Oceanography, 4, p. 1–18.

Lindau, R, Venema, V., 2019: A new method to study inhomogeneities in climate records: Brownian motion or random deviations? International Journal Climatology, 39: p. 4769– 4783. Manuscript: Article:

Mestre, Olivier, Peter Domonkos, Franck Picard, Ingeborg Auer, Stephane Robin, Émilie Lebarbier, Reinhard Boehm, Enric Aguilar, Jose Guijarro, Gregor Vertachnik, Matija Klancar, Brigitte Dubuisson, Petr Stepanek, 2013: HOMER: a homogenization software - methods and applications. IDOJARAS, Quarterly Journal of the Hungarian Meteorological Society, 117, no. 1, pp. 47–67.

Thursday, 23 April 2020

Correcting inhomogeneities when all breaks are perfectly known

Much of the scientific literature on the statistical homogenization of climate data is about the detection of breaks, especially the literature before 2012. Much of the more recent literature studies complete homogenization algorithms. That leaves a gap for the study of correction methods.

Spoiler: if we know all the breaks perfectly, the correction method removes trend biases from a climate network perfectly. I found the most surprising outcome that in this case the size of the breaks is irrelevant for how well the correction method works, what matters is the noise.

This post is about a study filling this gap by Ralf Lindau and me. The post assumes you are familiar with statistical homogenization, if not you can find more information here. For correction you naturally need information on the breaks. To study correction in isolation as much as possible, we have assumed that all breaks are known. That is naturally quite theoretical, but it makes it possible to study the correction method in detail.

The correction method we have studied is a so-called joint correction method, that means that the corrections for all stations in a network are computed in one go. The somewhat unfortunate name ANOVA is typically used for this correction method. The equations are the same as those of the ANOVA test, but the application is quite different, so I find this name confusing.

This correction method makes three assumptions. 1) That all stations have the same regional climate signal. 2) That every station has its own break signal, which is a step function with the positions of the steps given by the known breaks. 3) That every station also has its own measurement and weather noise. The algorithm computes the values of the regional climate signal and the levels of the step functions by minimizing this noise. So in principle the method is a simple least square regression, but with much more coefficients than when you use it to compute a linear trend.

Three steps

In this study we compute the errors after correction in three ways, one after another. To illustrate this let’s start simple and simulate 1000 networks of 10 stations with 100 years/values. In the first examples below these stations have exactly five breaks, whose sizes are drawn from a normal distribution with variance one. The noise, simulating measurement uncertainties and differences in local weather, is also noise with a variance of one. This is quite noisy for European temperature annual averages, but happens earlier in the climate record and in other regions. Also to keep it simple there is no net trend bias yet.

The figure to the right is a scatterplot with theoretically 1000*10*100=1 million yearly temperature averages as they were simulated (on the x-axis) and after correction (y-axis).

Within the plots we show some statistics, on the top left these are 1) the mean of x, i.e. the mean of the inserted inhomogeneities. 2) Then the variance of the inserted inhomogeneities x. 3) Then the mean of the computed corrections y. 4) Finally the variance of the corrections.

In the lower right, 1) the correlation (r) is shown and 2) the number of values (n). For technical reasons, we only show a sample of the 1 million points in the scatterplot, but these statistics are based on all values.

The results look encouraging, they show a high correlation: 0.96. And the points nicely scatter around the x=y line.

The second step is to look at the trends of the stations, there is one trend per station, so we have 1000*10=10,000 of them. See figure to the right. The trend is computed in the standard way using least squares linear regression. Trends would normally have the unit °C per year or century. Here we multiplied the trend with the period, so the values are the total change due to the trend and have unit °C.

The values again scatter beautifully around x=y and the correlation is as high as before: 0.95.

The final step is to compute the 1000 network trends. The result is shown below. The averaging over 10 stations reduces the noise in the scatterplot and the values beautifully scatter around the x=y line, while the correlation is now smaller, it is still decent: 0.81. Remember we started with quite noisy data where the noise was as large as the break signal.

The remaining error

In the next step, rather than plotting the network trend after correction on the y-axis, we plotted the difference between this trend and the inserted network mean trend, which is the trend error remaining after correction. This is plotted in the left panel below. For this case the uncertainty after correction is half of the uncertainty before correction in terms of the printed variances. It is typical to express uncertainties as standard deviations, then the remaining trend error is 71%. Furthermore, their averages are basically zero, so no bias is introduced.

With a signal having as much break variance as noise variance from the measurement and weather differences between the stations, the correction algorithm naturally cannot reconstruct the original inhomogeneities perfectly, but it does so decently and its errors have nice statistical properties.

Now if we increase the variance of the break signal by a factor two we get the result shown in the right panel. Comparing the two panels, it is striking that the trend error after correction is the same, it does not depend on the break signal, only the noise determines how accurate the trends are. In case of large break signals this is nice, but if the break signal is small, this will also mean that the the correction can increase the random trend error. That could be the case in regions where the networks are sparse and the difference time series between two neighboring stations consequently quite noisy.

Large-scale trend biases

This was all quite theoretical, as the networks did not have a bias in their trends. They did have a random trend error due to the inserted inhomogeneities, but averaging over many such networks of 10 stations the trend error would tend to zero. If that were the case in reality, not many people would work on statistical homogenization. The main aim is the reduce the uncertainties in large-scale trends due to (possible) large-scale biases in the trends.

Such large-scale trend biases can be caused by changes in the thermometer screens used, the transition from manual to automatic observations, urbanization around the stations or relocations of stations to better sites.

If we add a trend bias to the inserted inhomogeneities and correct the data with the joint correction method, we find the result to the right. We inserted a large trend bias to all networks of 0.9 °C and after correction it was completely removed. This again does not depend on the size of the bias or the variance of the break signal.

However, this all is only true if all breaks are known, before I write a post about the more realistic case were the breaks are perfectly known, I will first have to write a post about how well we can detect breaks. That will be the next homogenization post.

Some equations

Next to these beautiful scatterplots, the article has equations for each of the the above mentioned three steps 1) from the inserted breaks and noise to what this means for the station data, 2) how this affects the station trend errors, and 3) how this results in network trends.

With equations for the influence of the size of the break signal (the standard deviation of the breaks) and the noise of the difference time series (the standard deviation of the noise) one can then compute how the trend errors before and after correction depend on the signal to noise ratio (SNR), which is the standard deviation of the breaks divided by the standard deviation of the noise. There is also a clear dependence on the number of breaks.

Whether the network trends increase or decrease due to the correction method is determined by the quite simple equation: 6 times the SNR divided by the number of breaks. So if the SNR is one, as in the initial example of this post and the number of breaks is 6 or smaller the correction would improve the trend error, while if there are more than 7 breaks the correction would add a random trend error. This simple equation ignores a weak dependence of the results on the number of stations in the networks.

Further research

I started saying that the correction methods was a research gap, but homogenization algorithms have many more steps beyond detection and correction, which should also be studied in isolation if possible to gain a better understanding.

For example, the computation of a composite reference. The selection of reference stations. The combination of statistical homogenization with metadata on documented changes in the measurement setup. And so on. The last chapter of the draft guidance on homogenization describes research needs, including research on homogenization methods. There are still of lot of interesting and important questions.

Other posts in this series

Part 5: Statistical homogenization under-corrects any station network-wide trend biases

Part 4: Break detection is deceptive when the noise is larger than the break signal

Part 3: Correcting inhomogeneities when all breaks are perfectly known

Part 2: Trend errors in raw temperature station data due to inhomogeneities

Part 1: Estimating the statistical properties of inhomogeneities without homogenization


Lindau, R, V. Venema, 2018: On the reduction of trend errors by the ANOVA joint correction scheme used in homogenization of climate station records. International Journal of Climatology, 38, pp. 5255– 5271. Manuscript: Article:

Monday, 20 April 2020

Corona Virus Update: the German situation improved, but if we relax measures so much that the virus comes back it will not be just local outbreaks (part 32)

The state of the epidemic has improved in Germany and we now have about half number of people getting ill compared to the peak we had a month ago. This has resulted in calls to relax the social distancing measures. The German states have decided that mostly nothing will happen the next two weeks, but small shops will open again (and some other shops) and in some states some school classes may open again (although I am curious whether that will actually happen, German Twitter is not amused).

An estimate of the number of new cases by the date these people got ill. Similar graphs tend to show the new cases for the date they were known with the health departments, by looking at the date people became ill, which is often much earlier, you can see a faster response to changes in social distancing. In dark blue you see the cases where the date someone got ill is known. In grey were it was estimated because only the case is know, but not when someone became ill. In light blue is an estimate for how many cases will still come in.

So in the last episode of the Corona Virus Update science journalist Korinna Henning tried to get the opinion of Christian Drosten on these political measures. He does not like giving political advice, but he did venture that some politicians seem to wrongly think measures can be relaxed without the virus coming back. The two weeks that the lockdown continues should be used to prepared other measures that can replace the lockdown-type measures, such a track and trace CoronaApp and the public wearing everyday masks.

Another reason it may be possible to relax measures somewhat would be that the virus may spread less efficiently in summer. It is not expected to go away, but the number of people who are infected by one infected person may go down a bit.

When the virus comes back, either because we relaxed social distancing too much too early or because of the winter, it will look differently from this first wave. This first wave was characterized by local outbreaks. A second wave would be everywhere as the virus (and it various mutations) are spreading evenly geographically.

Korinna Henning asks Drosten to explain why it is easier for him to call COVID-19 a pandemic than for the World Health Organization. This question was inspired by Trump complaining that the WHO called the pandemic too late. Drosten notes that it has political consequences when the WHO calls the situation a pandemic, but that does not influence the situation in your country and what Trump could have done.

Really interesting was the part at the end on some possible (not guaranteed) positive surprises.

Prof. Dr. Christian Drosten, expert for emerging viruses and developer of the WHO SARS-CoV-2 virus test, which was used in 150 countries.

The situation and measures in Germany

Korinna Hennig:
What's your assessment, how long would [the reproductive rate] have to stay below one for it to have a really long-term effect and we're not going to say that at some point we have to close all the schools again.
Christian Drosten:
I believe there is talk of months [in a report by the Helmholtz Association]. I can well believe that this is the case. However, this is not the path that has been chosen in essence [by the German government], but rather - I believe - the idea has arisen that the intention is to keep it within the current range, perhaps by taking additional measures to reduce the pressure a little more.

That is an important point of view, and one that needs to be understood. It is not primarily a question of saying that we have now achieved a great deal, that the measures have already had a considerable impact. And now we are simply letting them go a tad, because we no longer want to. Then at some point we will have to take a look and then we will have to consider how to proceed, that is one view.

The other is that everything will work out fine. Sometimes you can hear that between the lines. I have the feeling, particularly among the general public, that many people, even in politics, are speculating that it will not come back at all, that it will not pick up any momentum. Unfortunately, that is not what the epidemiological modelers are saying, but it is generally assumed that, if nothing is offered as a counter-offer to this relaxation of measures, it will really get out of hand.

And the idea is, of course - and this is a very real idea in Germany - that people say that they are now relaxing these measures to a small extent, but to a really small extent. It is rather the case that corrections are being made in places where we think we can perhaps get away with it without the efficient reduction of transmission suffering in the first place. And now, in the time that has been gained by the decision, it is preparing to allow other measures to come into force. And this of course includes the great promise of automated case tracking.

The cell phone tracking ... doesn't have to do the job completely, but you can combine it. You could say that there is a human manual case tracking system, but it gets help from such electronic measures, while you introduce these electronic measures. After all, this is not something that is introduced overnight; there must be some transition. I believe that the few weeks of time that have now been gained once again can be used to introduce such measures, and that is where a great deal of faith comes from at the moment.

Of course, there are other things to hope for as additional effects, such as, for example, a recommendation on the wearing of masks by the public. That could have an additional effect. Of course there will also be a small additional effect on seasonality. We have already discussed this, and there are studies which say that, unfortunately, there is probably not a large effect on seasonality, but there is a small effect on seasonality.

That is where things are coming together, so that we hope that the speed of propagation will perhaps slow down again overall and that we will at least be able to enter an region over the summer and into the autumn, where we will unfortunately see the effect of winter coming again, a possible winter wave, but where we will then have the first pharmaceutical interventions. Perhaps a first drug, with which certain risk patients could be treated in an early stage. Maybe first use studies, so efficacy studies of first vaccines. This is the overall concept, which one hopes will work.
Currently one infected person infects 0.7 or 0.8 other persons (RO, the reproduction number). That is behind the decline in the number of new cases. Theoretically you could thus allow for 25% more contacts while still being in a stable situation. I would be surprised if the small relaxations decided for the next two weeks would do that. I do worry that these relaxations make people take to problem less seriously and that can quickly lead to 25% more contacts.

I would personally prefer this decline to continue until we get to a level where containment by manual tracking infected people and their contacts becomes an effective way to fight the epidemic; Mailab explains it well in German.

If we get the tracking of infected people with a CoronaApp working, it would matter much less at which level of contagion we start, but I do not expect that the CoronaApp will be able to do all the work, it will likely need to be complemented by manual tracking. With the current plans, according to rumours in the media, placing less emphasis on privacy of the users, I worry that too few will participate to make any kind of dent. An app were we can only hope and need to trust that the government keeps its side of the bargain and does not abuse the data would also be less useful in large parts of the world where you can definitely not trust the government.

That some states are already starting with opening up some classes is in principle a good thing. But it goes too fast, the schools are not prepared yet and I see quite some backlash coming. If done well, by opening a few school classes we could have learned how to do this before we do more and we could study how much this contributes to a higher reproduction number R0. If we are lucky maybe hardly; see the last section on possible positive surprises.


The flu normally goes away in summer, this is not expected for SARS-2, but the reproduction number could be 0.5 lower, that is that one infected person would infect half a person less. Without measures it is expected to be between 2 and 3 and we have to keep this reproduction number below 1 to avoid that the situation gets out of hand again. The summer may thus help a bit, which could mean less stringent restrictions.

It is not well understood what exactly makes the summer harder for the flu and even less for SARS-2. One aspect is likely that people are outside more and ventilate buildings more, which dilutes and dries the virus. Also when it comes to schools, it may be an option to do the classes outside, where the distancing rules could be less strict than indoors.

Museum could create large sculpture gardens outside for the summer. As the conference centres are empty and unused they could be used as social distancing museums. The empty hotels could be used to quarantine people who might otherwise infect other people in their households. We have to support the hotels anyway to survive until the pandemic is over.

I have often dreamed of conferences while walking outside in nature. You could transmit the voice of the speaker with a headset. The power points slides with Comic Sans would be missing. This may be the year to start this as alternative to video conferences. (Although there would still be transport.)

World Health Organization and Trump

Korinna Hennig:
Could you briefly explain again what the difference is when you say here in the podcast for example: Yes, we have a pandemic in an early phase. And the WHO is still hesitating for a very long time. What is the crucial difference when the WHO makes such an assessment?
Christian Drosten:
So I am only an individual and can give my opinion, which you can follow or not. You can take me for someone who knows what he's doing. Or you can say: He's just a fool and he says things here.

Of course, this has different consequence with the WHO. In the case of a UN organisation, this has certain consequences, not only when it comes to saying that this is a pandemic, but also, and especially, when it comes to saying that this is PHEIC, i.e. Public Health Emergency of International Concern. That is a term used in the context of international health regulations. This then also has consequences for intergovernmental organisations. This scope has certainly also led to delays in all these decisions by the WHO.

Of course there are advisory bodies. After all, the WHO is not a person, but an opinion-forming and opinion-collecting organisation. Experts are called together, committees that have to vote at some point and where there is sometimes disagreement. And then they say that we will meet again next week and until then we will observe the situation again. This then leads to decisions that are perceived as a delay by some countries. This is an ex post evaluation of the WHO's behaviour.

At the moment this is again all about politics. And it is about a decision by Donald Trump, who has now said that he is suspending the WHO payments, the contributions, because the WHO did not say certain things early on.

It was, of course, known relatively early on from individual case reports that cases had already been introduced in the USA. And now to say that it is a pandemic that is taking place in all other countries ... So the statement that this is a pandemic is to acknowledge the situation, that this is far is widespread. This has nothing to do with the assessment for your own country. Since you know, it is in your own country, you have to ask yourself: Will do I act or not?
Korinna Hennig:
And there are of course financial liabilities between countries that are linked to the WHO.

Local outbreaks in wave 1, everywhere in wave 2

If there is a second wave, it will not look like this first wave.
Christian Drosten:
What happened in the case of the Spanish flu was this: We also had a first wave there in some major US cities - that is very, very well documented - that caught our attention. However, it did not occur in all places, but was distributed extremely unevenly locally. It was conspicuous here and there, and elsewhere people did not even notice that this disease existed at all.

Even there, even at that time, people were already working with curfews and similar things. This was also happening in spring, by the way. Then it went into the summer and apparently there was a strong seasonal effect. And you didn't even notice the disease anymore. And under the cover of this seasonal effect - we can perhaps now envisage this as, under the cover of the social distancing measures that are currently in force - this illness has, however, unnoticed, spread much more evenly geographically.

And then, when the Spanish flu hit a winter wave, the situation was suddenly quite different. Then chains of infection started at the same time in all places because the virus had spread unnoticed everywhere and no one had paid any attention to it. This is of course an effect that will also occur in Germany, because we do not have a complete ban on leaving and travelling here, and of course we do not have zero transmission either, but we have an R, i.e. a reproduction number that is around or sometimes perhaps even slightly below one. But that does not mean that no more is being transmitted.
So you can look at our homepage, for example, at the Institute of Virology at the Charité - we have now published a whole set of [virus] sequences from Germany. You can see that the viruses in Germany are already very much intermixed, that the local clustering is slowly disintegrating and that all viruses can be found in all places. So let me put it very simply.It is slowly but surely becoming very intermixed. ...

We'll be in a different situation when winter sets in. ... Suddenly you'd be surprised that the virus starts everywhere at once. Of course it is a completely different impact that such a wave of infection would have.
What I find interesting to see it that there is nearly no difference in virus activity between cities and rural regions in Germany anymore. If anything, just looking at the map below, I have the impression that rural regions have more virus activity. On the other hand, in the beginning, I feel there was more activity in the cities.

Yesterday's map of the RKI, the German CDC, of the number of new cases over the last week per 100,000 inhabitants. The larger cities are denoted by a small red dot, the location of the smaller cities can sometimes be seen as a smaller region in a different colour. The darkest region is an outbreak, which was likely due to a strong beer feast.

Positive surprises

Christian Drosten:
It is also quite possible that there will be positive surprises. For example, we still know nothing about children. It is even the case that in studies that are very systematically designed, this effect is often still left out. We know from other coronavirus diseases, especially MERS, that not only are children hardly affected, but they are hardly ever infected. Now the question is, of course, whether this is also the case with this disease, that not only they do not get any symptoms and are therefore not so conspicuous in the statistics, but that they are somehow resistant in a certain way and that they do not even have to be counted in the population to be infected. So what is 70% of the population? Is it possible to consider the 20 percent of children as finished, because they do not get infected at all? In reality, only 50 percent of the population need to be infected? This is a big gap, which can also be interpreted as a great hope.

And there is something else - we are anticipating that, epidemiological modellers are doing that, and they are taking that into account: That there may be an unnoticed background immunity from the common cold corona viruses, because they are already related in some way to the SARS-2 virus. It could happen, however, that certain people, because they have had a cold from such a corona virus in the last year or two, are protected in a previously unnoticed way.

All I want to say is that we are currently observing more and more - and a major study has just come out of China in the preprint realm - that in well-observed household situations, the secondary attack rate, that is to say the rate of infected persons who become infected when there is an index case in the household, an infected person, is quite low. It is in the range of 12, 13, 14 percent. Depending on the correction, you can also say that it is perhaps 15, 16, 17 percent. But it does not lie at 50 or 60 percent or higher, where you would then say that these are probably just random effects. The one who didn't get infected wasn't at home during the infectious period or something.

How is it possible that so many people who were supposed to be in the household are not infected? Is there some sort of background immunity involved?

And there are these residual uncertainties. But at this stage, even if you include all these residual uncertainties in these models, you still get the picture that the medical system and the intensive care unit capacity would be overloaded. That is why it is certainly right at the moment to have taken these measures. We must now carry out intensive research work as quickly as possible, as we clarify issues such as: What is really wrong with the children? Do they not get seriously ill, but are they in fact infected and are giving off the virus and carrying it into the family? Or are they resistant in some way? The other question that we absolutely must also answer is: why do relatively few, perhaps even cautiously put, unexpectedly few get infected in the household? This is a realisation that is now maturing so slowly.

As I said, a new preprint has just appeared from China, and a few other studies suggest that this is the case. The Munich case tracking study, for example, has already hinted at this a bit. You have to take a closer look at that. Is there perhaps a hitherto unnoticed backgroundimmunity, even if only partial immunity?

That wouldn't mean that we were wrong at this point in time, and what we have done now was wrong. At the moment, even if you factor in these effects, you get the impression that it's right to stop this, that we're not getting into such a rampage that we can no longer control. But for the estimation of how long the whole thing will last, new information could arise from this. It could then be - and I would like to say this now, perhaps as a message of hope - that in a few weeks or months, new information will come out of science that says that the infection activity will probably stop earlier than we thought because of this special effect.

But I don't want to say that I can announce something now. These are not hints from me, or data that have been available for a long time, but that I wouldn't want to say in public or anything. Rather, they are simply fundamental considerations that we simply know too little about this disease at the moment. And that the knowledge, which is actually growing from week to week, will also influence the current projections.

Other podcasts

Part 31: Corona Virus Update: Don't take stories about reinfected cured patients too seriously.

Part 28: Corona Virus Update: exit strategy, masks, aerosols, loss of smell and taste.

Part 27: Corona Virus Update: tracking infections by App and do go outside

Part 23: Corona Virus Update: need for speed in funding and publication, virus arrival, from pandemic to endemic

Part 22: Corona Virus Update: scientific studies on cures for COVID-19.

Part 21: Corona Virus Update: tests, tests, tests and how they work.

Part 20: Corona Virus Update: Case-tracking teams, slowdown in Germany, infectiousness.

Part 19: Corona Virus Update with Christian Drosten: going outside, face masks, children and media troubles.

Part 18: Leading German virologist Prof. Dr. Christian Drosten goes viral, topics: Air pollution, data quality, sequencing, immunity, seasonality & curfews.

Related reading

This Corona Virus Update podcast and its German transcript. Part 32.

All podcasts and German transcripts of the Corona Virus Update.