Showing posts with label blog review. Show all posts
Showing posts with label blog review. Show all posts

Sunday, 1 May 2016

Christy and McNider: Time Series Construction of Summer Surface Temperatures for Alabama

John Christy and Richard McNider have a new paper in the AMS Journal of Applied Meteorology and Climatology called "Time Series Construction of Summer Surface Temperatures for Alabama, 1883–2014, and Comparisons with Tropospheric Temperature and Climate Model Simulations". Link: Christy and McNider (2016).

This post gives just few quick notes on the methodological aspects of the paper.
1. They select data with a weak climatic temperature trend.
2. They select data with a large cooling bias due to improvements in radiation protection of thermometers.
3. They developed a new homogenization method using an outdated design and did not test it.

Weak climatic trend

Christy and McNider wrote: "This is important because the tropospheric layer represents a region where responses to forcing (i.e., enhanced greenhouse concentrations) should be most easily detected relative to the natural background."

The trend in the troposphere should a few percent stronger than at the surface; mainly in the tropics. However, it is mainly interesting that they see a strong trend as a reason to prefer tropospheric temperatures, because when it comes to the surface they select the period and temperature with the smallest temperature trend: the daily maximum temperatures in summer.

The trend in winter due to global warming should be 1.5 times the trend in summer and the trend in the night time minimum temperatures is stronger than the trend in the day time maximum temperatures, as discussed here. Thus Christy and McNider select the data with the smallest trend for the surface. Using their reasoning for the tropospheric temperatures they should prefer night time winter temperatures.

(And their claim on the tropospheric temperatures is not right because whether a trend can be detected does not only depend on the signal, but also on the noise. The weather noise due to El Nino is much stronger in the troposphere and the instrumental uncertainties are also much larger. Thus the signal to noise ratio is smaller for the tropospheric temperatures, even if the signal were as long as the surface observations.

Furthermore, I am somewhat amused that there are still people interested in the question whether global warming can be detected.)

[UPDATE. Tamino shows that within the USA, Alabama happens to be the region with the least warming. The more so for the maximum temperature. The more so for the summer temperature.]

Cooling bias

Then they used data with a very large cooling bias due to improvements in the protection of the thermometer for (solar and infra-red) radiation. Early thermometers were not protected as well against solar radiation and typically record too high temperatures. Early thermometers also recorded too cool minimum temperatures; the thermometer should not see the cold sky, otherwise it radiates out to it and cools. The warming bias in the maximum temperature is larger than the cooling bias in the minimum temperature, thus the mean temperature still has some bias, but less than the maximum temperature.

Due to this reduction in the radiation error summer temperatures have a stronger cooling bias than winter temperatures.

The warming effect of early measurements on the annual means is probably about 0.2 to 0.3°C. In the maximum temperature is will be a lot higher and in the summer temperature it will again be a lot higher.

That is why most climatologists use the annual means. Homogenization can improve climate data, but it cannot remove all biases. Thus it is good to start with data that has least bias. Much better than starting with a highly biased dataset like Christy and McNider did.

Statistical homogenization removes biases by comparing a candidate station to its neighbour. The stations need to be close enough together so that the regional climate can be assumed to be similar in both stations. The difference between two stations is then weather noise and inhomogeneities (non-climatic changes due to changes in the way temperature was measured).

If you want to be able to see the inhomogeneities you thus need to have well correlated neighbors that have as little weather noise as possible. By using only the maximum temperature, rather than the mean temperature, you increase the weather noise. But using the monthly means in summer, rather than the annual means or at the very least the summer means, you increase the weather noise. By going back in time more than a century you increase the noise because we had less stations to compare with at the time.

They keyed part of the the data themselves mainly for the period before 1900 from the paper records. It sounds as if they performed no quality control of these values (to detect measurement errors). This will also increase the noise.

With such a low signal to noise ratio (inhomogeneities that are small relative to the weather noise in the difference time series), the estimated date of the breaks they still found will have a large uncertainty. It is thus a pity that they purposefully did not use information from station histories (metadata) to get the date of the breaks right.

Homogenization method

They developed their own homogenization method and only tested it on a noise signal with one break in the middle. Real series have multiple breaks; in the USA typically every 15 years. Furthermore also the reference series has breaks.

The method uses the detection equation from the Standard Normal Homogeneity Test (SNHT), but then starts using different significance levels. Furthermore for some reason it does not use the hierarchical splitting of SNHT to deal with multiple breaks, but it detects on a window, in which it is assumed there is only one break. However, if you select the window too long it will contain more than one break and if you select the window too short the method will have no detection power. You would thus theoretically expect the use of a window for detection to perform very badly and this is also what we found in a numerical validation study.

I see no real excuse not to use better homogenization methods (ACMANT, PRODIGE, HOMER, MASH, Craddock). These are build to take into account that also the reference station has breaks and that a series will have multiple breaks; no need for ad-hoc windows.

If you design your own homogenization method, it is good scientific practice to test it first, to study whether it does what you hope it does. There is, for example, the validation dataset of the COST Action HOME. Using that immediately allows you to compare your skill to the other methods. Given the outdated design principles, I am not hopeful the Christy and McNider homogenization method would score above average.

Conclusions

These are my first impressions on the homogenization method used. Unfortunately I do not have the time at the moment to comment on the non-methodological parts of the paper.

If there are no knowledgeable reviewers available in the USA, it would be nice if the AMS would ask European researchers, rather than some old professor who in the 1960s once removed an inhomogeneity from his dataset. Homogenization is a specialization, it is not trivial to make data better and it really would not hurt if the AMS would ask for expertise from Europe when American experts are busy.

Hitler is gone. The EGU general assembly has a session on homogenization, the AGU does not. The EMS has a session on homogenization, the AMS does not. EUMETNET organizes data management workshops, a large part of which is about homogenization; I do not know of an American equivalent. And we naturally have the Budapest seminars on homogenization and quality control. Not Budapest, Georgia, nor Budapest, Missouri, but Budapest, Hungary, Europe.



Related reading

Tamino: Cooling America. Alabama compared to the rest of contiguous USA.

HotWhopper discusses further aspects of this paper and some differences between the paper and the press release. Why nights can warm faster than days - Christy & McNider vs Davy 2016

Early global warming

Statistical homogenisation for dummies

Friday, 27 June 2014

Self-review of problems with the HOME validation study for homogenization methods

In my last post, I argued that post-publication review is no substitute for pre-publication review, but it could be a nice addition.

This post is a post-publication self-review, a review of our paper on the validation of statistical homogenization methods, also called benchmarking when it is a community effort. Since writing this benchmarking article we have understood the problem better and have found some weaknesses. I have explained these problems on conferences, but for the people that did not hear them, please find them below after a short introduction. We have a new paper in open review that explains how we want to do better in the next benchmarking study.

Benchmarking homogenization methods

In our benchmarking paper we generated a dataset that mimicked real temperature or precipitation data. To this data we added non-climatic changes (inhomogeneities). We requested the climatologists to homogenize this data, to remove the inhomogeneities we had inserted. How good the homogenization algorithms are can be seen by comparing the homogenized data to the original homogeneous data.

This is straightforward science, but the realism of the dataset was the best to date and because this project was part of a large research program (the COST Action HOME) we had a large number of contributions. Mathematical understanding of the algorithms is also important, but homogenization algorithms are complicated methods and it is also possible to make errors in the implementation, thus such numerical validations are also valuable. Both approaches complement each other.


Group photo at a meeting of the COST Action HOME with most of the European homogenization community present. These are those people working in ivory towers, eating caviar from silver plates, drinking 1985 Romanee-Conti Grand Cru from crystal glasses and living in mansions. Enjoying the good live on the public teat, while conspiring against humanity.

The main conclusions were that homogenization improves the homogeneity of temperature data. Precipitation is more difficult and only the best algorithms were able to improve it. We found that modern methods improved the quality of temperature data about twice as much as traditional methods. It is thus important that people switch to one of these modern methods. My impression from the recent Homogenisation seminar and the upcoming European Meteorological Society (EMS) meeting is that this seems to be happening.

1. Missing homogenization methods

An impressive number of methods participated in HOME. Also many manual methods were applied, which are validated less because this is more work. All the state-of-the-art methods participated and most of the much used methods. However, we forgot to test a two- or multi-phase regression method, which is popular in North America.

Also not validated is HOMER, the algorithm that was designed afterwards using the best parts of the tested algorithms. We are working on this. Many people have started using HOMER. Its validation should thus be a high priority for the community.

2. Size breaks (random walk or noise)

Next to the benchmark data with the inserted inhomogeneities, we also asked people to homogenize some real datasets. This turned out to be very important because it allowed us to validate how realistic the benchmark data is. Information we need to make future studies more realistic. In this validation we found that the size of the benchmark in homogeneities was larger than those in the real data. Expressed as the standard deviation of the break size distribution, the benchmark breaks were typically 0.8°C and the real breaks were only 0.6°C.

This was already reported in the paper, but we now understand why. In the benchmark, the inhomogeneities were implemented by drawing a random number for every homogeneous period and perturbing the original data by this amount. In other words, we added noise to the homogeneous data. However, the homogenizers that requested to make breaks with a size of about 0.8°C were thinking of the difference from one homogeneous period to the next. The size of such breaks is influenced by two random numbers. Because variances are additive, this means that the jumps implemented as noise were the square root of two (about 1.4) times too large.

The validation showed that, except for the size, the idea of implementing the inhomogeneities as noise was a good approximation. The alternative would be to draw a random number and use that to perturb the data relative to the previously perturbed period. In that case you implement the inhomogeneities as a random walk. Nobody thought of reporting it, but it seems that most validation studies have implemented their inhomogeneities as random walks. This makes the influence of the inhomogeneities on the trend much larger. Because of the larger error, it is probably easier to achieve relative improvements, but because the initial errors were absolutely larger, the absolute errors after homogenization may well have been too large in previous studies.

You can see the difference between a noise perturbation and a random walk by comparing the sign (up or down) of the breaks from one break to the next. For example, in case of noise and a large upward jump, the next change is likely to make the perturbation smaller again. In case of a random walk, the size and sign of the previous break is irrelevant. The likeliness of any sign is one half.

In other words, in case of a random walk there are just as much up-down and down-up pairs as there are up-up and down-down pairs, every combination has a chance of one in four. In case of noise perturbations, up-down and down-up pairs (platform-like break pairs) are more likely than up-up and down-down pairs. The latter is what we found in the real datasets. Although there is a small deviation that suggests a small random walk contribution, but that may also be because the inhomogeneities cause a trend bias.

3. Signal to noise ratio varies regionally

The HOME benchmark reproduced a typical situation in Europe (the USA is similar). However, the station density in much of the world is lower. Inhomogeneities are detected and corrected by comparing a candidate station to neighbouring ones. When the station density is less, this difference signal is more noisy and this makes homogenization more difficult. Thus one would expect that the performance of homogenization methods is lower in other regions. Although, also the break frequency and break size may be different.

Thus to estimate how large the influence of the remaining inhomogeneities can be on the global mean temperature, we need to study the performance of homogenization algorithms in a wider range of situations. Also for the intercomparison of homogenization methods (the more limited aim of HOME) the signal (break size) to noise ratio is important. Domonkos (2013) showed that the ranking of various algorithms depends on the signal to noise ratio. Ralf Lindau and I have just submitted a manuscript that shows that for low signal to noise ratios, the multiple breakpoint method PRODIGE is not much better in detecting breaks than a method that would "detect" random breaks, while it works fine for higher signal to noise ratios. Other methods may also be affected, but possibly not in the same amount. More on that later.

4. Regional trends (absolute homogenization)

The initially simulated data did not have a trend, thus we explicitly added a trend to all stations to give the data a regional climate change signal. This trend could be both upward or downward, just to check whether homogenization methods might have problems with downward trends, which are not typical of daily operations. They do not.

Had we inserted a simple linear trend in the HOME benchmark data, the operators of the manual homogenization could have theoretically used this information to improve their performance. If the trend is not linear, there are apparently still inhomogeneities in the data. We wanted to keep the operators in the blind. Consequently, we inserted a rather complicated and variable nonlinear trend in the dataset.

As already noted in the paper, this may have handicapped the participating absolute homogenization method. Homogenization methods used in climate are normally relative ones. These methods compare a station to its neighbours, both have the same regional climate signal, which is thus removed and not important. Absolute methods do not use the information from the neighbours; these methods have to make assumptions about the variability of the real regional climate signal. Absolute methods have problems with gradual inhomogeneities and are less sensitive and are therefore not used much.

If absolute methods are participating in future studies, the trend should be modelled more realistically. When benchmarking only automatic homogenization methods (no operator) an easier trend should be no problem.

5. Length of the series

The station networks simulated in HOME were all one century long, part of the stations were shorter because we also simulated the build up of the network during the first 25 years. We recently found that criterion for the optimal number of break inhomogeneities used by one of the best homogenization methods (PRODIGE) does not have the right dependence on the number of data points (Lindau and Venema, 2013). For climate datasets that are about a century long, the criterion is quite good, but for much longer or shorter datasets there are deviations. This illustrates that the length of the datasets is also important and that it is important for benchmarking that the data availability is the same as in real datasets.

Another reason why it is important that the benchmark data availability to be the same as in the real dataset is that this makes the comparison of the inhomogeneities found in the real data and in the benchmark more straightforward. This comparison is important to make future validation studies more accurate.

6. Non-climatic trend bias

The inhomogeneities we inserted in HOME were on average zero. For the stations this still results in clear non-climatic trend errors because you only average over a small number of inhomogeneities. For the full networks the number of inhomogeneities is larger and the non-climatic trend error thus very small. It was consequently very hard for the homogenization methods to improve this small errors. It is expected that in real raw datasets there is a larger non-climatic error. Globally the non-climatic trend will be relatively small, but within one network, where the stations experienced similar (technological and organisational) changes, it can be appreciable. Thus we should model such a non-climatic trend bias explicitly in future.

International Surface Temperature Initiative

The last five problems will be solved in the International Surface Temperature Initiative (ISTI) benchmark . Whether a two-phase homogenization method will participate is beyond our control. We do expect less participants than in HOME because for such a huge global dataset, the homogenization methods will need to be able to run automatically and unsupervised.

The standard break sizes will be made smaller. We will make ten benchmarking "worlds" with different kinds of inserted inhomogeneities and will also vary the size and number of the inhomogeneities. Because the ISTI benchmarks will mirror the real data holdings of the ISTI, the station density and the length of the data will be the same. The regional climate signal will be derived from a global circulation models and absolute methods could thus participate. Finally, we will introduce a clear non-climate trend bias to several of the benchmark "worlds".

The paper on the ISTI benchmark is open for discussions at the journal Geoscientific Instrumentation, Methods and Data Systems. Please find the abstract below.

Abstract.
The International Surface Temperature Initiative (ISTI) is striving towards substantively improving our ability to robustly understand historical land surface air temperature change at all scales. A key recently completed first step has been collating all available records into a comprehensive open access, traceable and version-controlled databank. The crucial next step is to maximise the value of the collated data through a robust international framework of benchmarking and assessment for product intercomparison and uncertainty estimation. We focus on uncertainties arising from the presence of inhomogeneities in monthly surface temperature data and the varied methodological choices made by various groups in building homogeneous temperature products. The central facet of the benchmarking process is the creation of global scale synthetic analogs to the real-world database where both the "true" series and inhomogeneities are known (a luxury the real world data do not afford us). Hence algorithmic strengths and weaknesses can be meaningfully quantified and conditional inferences made about the real-world climate system. Here we discuss the necessary framework for developing an international homogenisation benchmarking system on the global scale for monthly mean temperatures. The value of this framework is critically dependent upon the number of groups taking part and so we strongly advocate involvement in the benchmarking exercise from as many data analyst groups as possible to make the best use of this substantial effort.


Related reading

Nick Stokes made a beautiful visualization of the raw temperature data in the ISTI database. Homogenized data where non-climatic trends have been removed is unfortunately not yet available, that will be released together with the results of the benchmark.

New article: Benchmarking homogenisation algorithms for monthly data. The post describing the HOME benchmarking article.

New article on the multiple breakpoint problem in homogenization. Most work in statistics is about data with just one break inhomogeneity (change point). In climate there are typically more breaks. Methods designed for multiple breakpoints are more accurate.

Part 1 of a series on Five statistically interesting problems in homogenization.


References

Domonkos, P., 2013: Efficiencies of Inhomogeneity-Detection Algorithms: Comparison of Different Detection Methods and Efficiency Measures. Journal of Climatology, Art. ID 390945, doi: 10.1155/2013/390945.

Lindau and Venema, 2013: On the multiple breakpoint problem and the number of significant breaks in homogenization of climate records. Idojaras, Quarterly Journal of the Hungarian Meteorological Service, 117, No. 1, pp. 1-34. See also my post: New article on the multiple breakpoint problem in homogenization.

Lindau and Venema, to be submitted, 2014: The joint influence of break and noise variance on the break detection capability in time series homogenization.

Willett, K., Williams, C., Jolliffe, I., Lund, R., Alexander, L., Brönniman, S., Vincent, L. A., Easterbrook, S., Venema, V., Berry, D., Warren, R., Lopardo, G., Auchmann, R., Aguilar, E., Menne, M., Gallagher, C., Hausfather, Z., Thorarinsdottir, T., and Thorne, P. W.: Concepts for benchmarking of homogenisation algorithm performance on the global scale, Geosci. Instrum. Method. Data Syst. Discuss., 4, 235-270, doi: 10.5194/gid-4-235-2014, 2014.

Monday, 30 September 2013

Reviews of the IPCC review

The first IPCC report (Working Group One), "Climate Change 2013, the physical science basis", has just been released.

One way to judge the reliability of a source, is to see what it states about a topic you are knowledgeable about. I work on homogenization of station climate data and was thus interested in the question how well the IPCC report presents the scientific state-of-the-art on the uncertainties in trend estimates due to historical changes in climate monitoring practices.

Furthermore, I have asked some colleague climate science bloggers to review the IPCC report on their areas of expertise. You find these reviews of the IPCC review report at the end of the post as they come in. I have found most of these colleagues via the beautiful list with climate science bloggers of Doug McNeall.

Large-Scale Records and their Uncertainties

The IPCC report is nicely structured. The part that deals with the quality of the land surface temperature observations is in Chapter 2 Observations: Atmosphere and Surface, Section 2.4 Changes in Temperature, Subsection 2.4.1 Land-Surface Air Temperature, Subsubsection 2.4.1.1 Large-Scale Records and their Uncertainties.

The relevant paragraph reads (my paragraph breaks for easier reading):
Particular controversy since AR4 [the last fourth IPCC report, vv] has surrounded the LSAT [land surface air temperature, vv] record over the United States, focussed upon siting quality of stations in the US Historical Climatology Network (USHCN) and implications for long-term trends. Most sites exhibit poor current siting as assessed against official WMO [World Meteorological Organisation, vv] siting guidance, and may be expected to suffer potentially large siting-induced absolute biases (Fall et al., 2011).

However, overall biases for the network since the 1980s are likely dominated by instrument type (since replacement of Stevenson screens with maximum minimum temperature systems (MMTS) in the 1980s at the majority of sites), rather than siting biases (Menne et al., 2010; Williams et al., 2012).

A new automated homogeneity assessment approach (also used in GHCNv3, Menne and Williams, 2009) was developed that has been shown to perform as well or better than other contemporary approaches (Venema et al., 2012). This homogenization procedure likely removes much of the bias related to the network-wide changes in the 1980s (Menne et al., 2010; Fall et al., 2011; Williams et al., 2012).

Williams et al. (2012) produced an ensemble of dataset realisations using perturbed settings of this procedure and concluded through assessment against plausible test cases that there existed a propensity to under-estimate adjustments. This propensity is critically dependent upon the (unknown) nature of the inhomogeneities in the raw data records.

Their homogenization increases both minimum temperature and maximum temperature centennial-timescale United States average LSAT trends. Since 1979 these adjusted data agree with a range of reanalysis products whereas the raw records do not (Fall et al., 2010; Vose et al., 2012a).

I would argue that this is a fair summary of the state of the scientific literature. That naturally does not mean that all statements are true, just that it fits to the current scientific understanding of the quality of the temperature observations over land. People claiming that there are large trend biases in the temperature observations, will need to explain what is wrong with Venema et al. (an article of mine from 2012) and especially Williams et al. (2012). Williams et al. (2012) provides strong evidence that if there is a bias in the raw observational data, homogenization can improve the trend estimate, but it will normally not remove the bias fully.

Personally, I would be very surprised if someone would find substantial trend biases in the homogenized US American temperature observations. Due to the high station density, this dataset can be investigated and homogenized very well.

Sunday, 29 July 2012

Blog review of the Watts et al. (2012) manuscript on surface temperature trends

[UPDATE: Skeptical Science has written an extensive review of the Watts et al. manuscript: "As it currently stands, the issues we discuss below appear to entirely compromise the conclusions of the paper." They mention all the important issues, except maybe for the selection bias mentioned below. Thus my fast preliminary review below can now be considered outdated. Have fun.]

Anthony Watts put his blog on hold for two days because he had to work on an urgent project.
Something’s happened. From now until Sunday July 29th, around Noon PST, WUWT will be suspending publishing. At that time, there will be a major announcement that I’m sure will attract a broad global interest due to its controversial and unprecedented nature.
What has happened? Anthony Watts, President of IntelliWeather has co-written a manuscript and a press release! As Mr. Watts is a fan of review by bloggers, here is my first reaction after looking through the figures and the abstract.

Tuesday, 17 July 2012

Investigation of methods for hydroclimatic data homogenization

The self-proclaimed climate sceptics have found an interesting presentation held at the General meeting of the European Geophysical Union.

In the words of Anthony Watts, the "sceptic" with one of the most read blogs, this abstract is a ”new peer reviewed paper recently presented at the European Geosciences Union meeting.” A bit closer to the truth is that this is a conference contribution by Steirou and Koutsoyiannis, based on a graduation thesis (Greek), which was submitted to the EGU session "Climate, Hydrology and Water Infrastructure". An EGU abstract is typically half a page, it is not possible to do a real review of a scientific study based on such a short text. The purpose of an EGU abstract is in practice to decide who gets a talk and who gets a poster, nothing more, everyone is welcome to come to EGU.