The new dataset is introuduced by Zeke Hausfather and Robert Rohde on Real Climate:
Daily temperature data is an important tool to help measure changes in extremes like heat waves and cold spells. To date, only raw quality controlled (but not homogenized) daily temperature data has been available through GHCN-Daily and similar sources. Using this data is problematic when looking at long-term trends, as localized biases like station moves, time of observation changes, and instrument changes can introduce significant biases.
For example, if you were studying the history of extreme heat in Chicago, you would find a slew of days in the late 1930s and early 1940s where the station currently at the Chicago O’Hare airport reported daily max temperatures above 45 degrees C (113 F). It turns out that, prior to the airport’s construction, the station now associated with the airport was on the top of a black roofed building closer to the city. This is a common occurrence for stations in the U.S., where many stations were moved from city cores to newly constructed airports or wastewater treatment plants in the 1940s. Using the raw data without correcting for these sorts of bias would not be particularly helpful in understanding changes in extremes.
The post explains in more detail how the BEST daily method works and presents some beautiful visualizations and videos of the data. Worth reading in detail.
Daily homogenizationWhen I understand the homogenization procedure of BEST right, it is based on their methods for the monthly mean temperature and this only accounts for non-climatic changes (inhomogeneities) in the mean temperature.
The example of a move from black roof in a city to an airport is also a good example that not only the mean can change. The black roof will show more variability because on hot sunny days the warm bias is larger than on windy cloudy days. Thus part of this variability is variability in solar insolation and wind.
Also the urban heat island could be a source of variability, the UHI is strongest on wind and cloud free days. Thus part of the variability in observed temperature will be due to variability in wind and clouds.
A nice illustration of the problem can be found in a recent article by Blair Trewin. He compares the distribution of two stations, one in a city near the coast and one at an airport more inland. In the past the station was in the city, nowadays it is at the airport. The modern measurements in the city that are shown below have been made to study the influence of this change.
For this plot he computed the 0th to the 100th percentile. The 50th percentile is the median, 50% of the data has a lower value. The 10th percentile is the value where 10% of the data is smaller, and so on. The 0th and 100th percentile in this plot are the minimum and maximum. What is displayed is the temperature difference between these percentiles. On average the difference is about 2°C, the airport is warmer. However, for the higher percentiles (95th) the difference is much larger. Trewin explains this by cooling of the city station by a land-sea circulation (sea breeze) often seen on hot summer days. For the highest percentiles (99th), the difference becomes smaller again because offshore wind override the sea breeze.
Clearly if you would homogenize this time series for the transition from the coast to the inland by only correcting the mean, you would still have a large inhomogeneity in the higher percentiles, which would still lead to non-climatic spurious trends in hot weather.
Thus we would need a bias correction of the complete probability distribution and not just its mean.
Or we should homogenize the indices we are interested in, for example percentiles or the number of days above 40°C. etc. The BEST algorithm being fully automatic could be well suited for such an approach.
Gridding and krigingAnother problem I see is the use of the interpolation method kriging to bring the data to a regular grid. The number of stations available to estimate the daily means of a grid box will determine it uncertainty and thus also how much this values fluctuates. It will be hard to distinguish changes in weather variability with changes in the error in this estimate due to changes in the configuration of the station network.
This problem can go in both ways. If you have many stations in a grid box, more stations would reduce the uncertainty in the estimate of the grid box mean. An increase in the number of station would then lead to a spurious decrease in variability and less extremes.
If there are less stations as grid boxes, the method performs an interpolation. Interpolation smooths a field. An empty grid box is estimated as the mean of many far away surrounding stations. That gives quite a smooth values. When a new station appears in this grid box, the grid box mean will be for a large part determined by this relatively noisy single measurement. This would thus give a spurious increase in variability.
The number of stations varies considerably in time; see figure below. Thus this could be a serious source of error, especially for daily data where the variability is high and the spatial correlations are relatively low.
Thus I would feel it is saver to analyze changes in extremes and weather variability on station data and avoid the additional problems of gridded datasets, especially at daily scales.
Using this dataset will in general be better than using raw data and it is great to have a global dataset. But please be careful and compare your results with those derived from carefully homogenized regional daily datasets. These methods are also still in their beginning stages, but if they can be applied, they should produce more reliable data.
Related postsStatistically interesting problems: correction methods in homogenization
HUME: Homogenisation, Uncertainty Measures and Extreme weather
A database with daily climate data for more reliable studies of changes in extreme weather
Introduction to series on weather variability and extreme events (part 1)
2. On the importance of changes in weather variability for changes in extremes (part 2, important further posts in this series are unfortunately still missing.)