The Task Team on Homogenization (TT-HOM) of the Open Panel of CCl Experts on Climate Monitoring and Assessment (OPACE-2) of the Commission on Climatology (CCl) of the World Meteorological Organization (WMO) has published their Guidance on the homogenisation of climate station data.
The guidance report was a bit longish, so at the end we decided that the last chapter on "Future research & collaboration needs" was best deleted. As chair of the task team and as someone who likes tp dream about what others could do in a comfy chair, I wrote most of this chapter and thus we decided to simply make it a blog post for this blog. Enjoy.
Introduction
This guidance is based on our current best understanding of
inhomogeneities and homogenisation. However, writing it also makes clear there is a need
for a better understanding of the problems.
A better mathematical understanding of statistical homogenisation is
important because that is what most of our work is based on. A
stronger mathematical basis is a prerequisite for future
methodological improvements.
A stronger focus on a (physical) understanding of inhomogeneities
would complement and strengthen the statistical work. This kind of
work is often performed at the station or network level, but also
needed at larger spatial scales. Much of this work is performed using
parallel measurements, but they are typically not internationally
shared.
In an observational science the strength of the outcomes depends on a
consilience of evidence. Thus having evidence on inhomogeneities from
both statistical homogenisation and physical studies strengthens the
science.
This chapter will discuss the needs for future research on
homogenisation grouped in five kinds of problems. In the first
section we will discuss research on improving our physical
understanding and physics-based corrections. The next section is about break detection, especially about two fundamental problems
in statistical homogenisation: the inhomogeneous-reference problem
and the multiple-breakpoint problem.
Next write about computing uncertainties in trends and
long-term variability estimates from homogenised data due to
remaining inhomogeneities. It may be possible to improve correction
methods by treating it as a statistical model selection problem. The last section discusses whether
inhomogeneities are stochastic or deterministic and how that may
affect homogenisation and especially correction methods for the
variability around the long-term mean.
For all the research ideas mentioned below, it is understood that in
future we should study more meteorological variables than
temperature. In addition, more studies on inhomogeneities across
variables could be helpful to understand the causes of
inhomogeneities and increase the signal to noise ratio.
Homogenisation by national offices has advantages because here all
climate elements from one station are stored together. This helps in
understanding and identifying breaks. It would help homogenisation
science and climate analysis to have a global database for all
climate elements, like iCOADS for marine data. A Copernicus project
has started working on this for land station data, which is an
encouraging development.
Physical understanding
It is a good scientific practice to perform parallel measurements in
order to manage unavoidable changes and to compare the results of
statistical homogenisation to the expectations given the cause of the
inhomogeneity according to the metadata. This information should also
be analysed on continental and global scales to get a better
understanding of when historical transitions took place and to guide
homogenisation of large-scale (global) datasets. This requires more
international sharing of parallel data and standards on the reporting
of the size of breaks confirmed by metadata.
The Dutch weather service KNMI published a protocol how to manage
possible future changes of the network, who decides what needs to be
done in which situation, what kind of studies should be made, where
the studies should be published and that the parallel data should be
stored in their central database as experimental data. A translation
of this report will soon be published by the WMO (Brandsma et al.,
2019) and will hopefully inspire other weather services to formalise
their network change management.
Next to statistical homogenisation, making and studying parallel
measurements, and other physical estimates, can provide a second line
of evidence on the magnitude of inhomogeneities. Having multiple
lines of evidence provides robustness to observational sciences.
Parallel data is especially important for the large historical
transitions that are most likely to produce biases in network-wide to
global climate datasets. It can validate the results of statistical
homogenisation and be used to estimate possibly needed additional
adjustments. The Parallel Observations Science Team of the
International Surface Temperature Initiative (ISTI-POST) is working
on building such a global dataset with parallel measurements.
Parallel data is especially suited to improve our physical understand
of the causes of inhomogeneities by studying how the magnitude of the
inhomogeneity depends on the weather and on instrumental design
characteristics. This understanding is important for more accurate
corrections of the distribution, for realistic benchmarking datasets
to test our homogenisation methods and to determine which additional
parallel experiments are especially useful.
Detailed physical models of the measurement, for example, the flow
through the screens, radiative transfer and heat flows, can also help
gain a better understanding of the measurement and its error sources.
This aids in understanding historical instruments and in designing
better future instruments. Physical models will also be paramount for
understanding the impact of the surrounding on the measurement
— nearby obstacles and surfaces influencing error sources and air
flow — to changes in the measurand, such as
urbanisation/deforestation or the introduction of irrigation.
Land-use changes, especially urbanisation, should be studied together
with relocations they may provoke.
Break detection
Longer climate series typically contain more than one break. This
so-called multiple-breakpoint problem is currently an important
research topic. A complication of relative homogenisation is that
also the reference stations can have inhomogeneities. This so-called
inhomogeneous-reference problem is not optimally solved yet. It is
also not clear what temporal resolution is best for detection and
what the optimal way is to handle the seasonal cycle in the
statistical properties of climate data and of many inhomogeneities.
For temperature time series about one break per 15 to 20 years is
typical and multiple breaks are thus common. Unfortunately, most
statistical detection methods have been developed for one break and
for the null hypothesis of white (sometimes red) noise. In case of
multiple breaks the statistical test should not only take the noise
variance into account, but also the break variance from breaks at
other positions. For low signal to noise ratios, the additional break
variance can lead to spurious detections and inaccuracies in the
break position (Lindau and Venema, 2018a).
To apply single-breakpoint tests on series with multiple breaks, one
ad-hoc solution is to first split the series at the most significant
break (for example, the standard normalised homogeneity test, SNHT)
and investigate the subseries. Such a greedy algorithm does not
always find the optimal solution. Another solution is to detect
breaks on short windows. The window should be short enough to contain
only one break, which reduces power of detection considerably. This
method is not used much nowadays.
Multiple breakpoint methods can find an optimal solution and are
nowadays numerically feasible. This can be done in a hypothesis
testing (MASH) or in a statistical model selection framework. For a
certain number of breaks these methods find the break combination
that minimize the internal variance, that is variance of the
homogeneous subperiods, (or you could also state that the break
combination maximizes the variance of the breaks). To find the
optimal number of breaks, a penalty is added that increases with the
number of breaks. Examples of such methods are PRODIGE (Caussinus &
Mestre, 2004) or ACMANT (based on PRODIGE; Domonkos, 2011b). In a
similar line of research Lu et al. (2010) solved the multiple
breakpoint problem using a minimum description length (MDL) based
information criterion as penalty function.
This penalty function of PRODIGE was found to be suboptimal (Lindau
and Venema, 2013). It was found that the penalty should be a function
of the number of breaks, not fixed per break and that the relation
with the length of the series should be reversed. It is not clear yet
how sensitive homogenisation methods respond to this, but increasing
the penalty per break in case of low SNR to reduce the number of
breaks does not make the estimated break signal more accurate (Lindau
and Venema, 2018a).
Not only the candidate station, also the reference stations will have
inhomogeneities, which complicates homogenisation. Such
inhomogeneities can be climatologically especially important when
they are due to network-wide technological transitions. An example of
such a transition is the current replacement of temperature
observations using Stevenson screens by automatic weather stations.
Such transitions are important periods as they may cause biases in
the network and global average trends and they produce many breaks
over a short period.
A related problem is that sometimes all stations in a network have a
break at the same date, for example, when a weather service changes
the time of observation. Nationally such breaks are corrected using
metadata. If this change is unknown in global datasets one can still
detect and correct such inhomogeneities statistically by comparison
with other nearby networks. That would require an algorithm that
additionally knows which stations belong to which network and
prioritizes correcting breaks found between stations in different
networks. Such algorithms do not exist yet and information on which
station belongs to which network for which period is typically not
internationally shared.
The influence of inhomogeneities in the reference can be reduced by
computing composite references over many stations, removing reference
stations with breaks and by performing homogenisation iteratively.
A direct approach to solving this problem would be to simultaneously
homogenise multiple stations, also called joint detection. A step in
this direction are pairwise homogenisation methods where breaks are
detected in the pairs. This requires an additional attribution step,
which attributes the breaks to a specific station. Currently this is
done by hand (for PRODIGE; Caussinus and Mestre, 2004; Rustemeier et
al., 2017) or with ad-hoc rules (by the Pairwise homogenisation
algorithm of NOAA; Menne and Williams, 2009).
In the homogenisation method HOMER (Mestre et al., 2013) a first
attempt is made to homogenise all pairs simultaneously using a joint
detection method from bio-statistics. Feedback from first users
suggests that this method should not be used automatically. It
should be studied how good this methods works and where the problems
come from.
Multiple breakpoint methods are more accurate as single breakpoint
methods. This expected higher accuracy is founded on theory (Hawkins,
1972). In addition, in the HOME benchmarking study it was numerically
found that modern homogenisation methods, which take the multiple
breakpoint and the inhomogeneous reference problems into account, are
about a factor two more accurate as traditional methods (Venema et
al., 2012).
However, the current version of CLIMATOL applies single-breakpoint
detection tests, first SNHT detection on a window then splitting, to
achieve results comparable to modern multiple-breakpoint methods with
respect to break detection and homogeneity of the data (Killick,
2016). This suggests that the multiple-breakpoint detection principle
may not be as important as previously thought and warrants deeper
study or the accuracy of CLIMATOL is partly due to an unknown
unknown.
The signal to noise ratio is paramount for the reliable detection of
breaks. It would thus be valuable to develop statistical methods that
explain part of the variance of a difference time series and remove
this to see breaks more clearly. Data from (regional) reanalysis
could be useful predictors for this.
First methods have been published to detect breaks for daily data
(Toreti et al., 2012; Rienzner and Gandolfi, 2013). It has not been
studied yet what the optimal resolution for breaks detection is
(daily, monthly, annual), nor what the optimal way is to handle the
seasonal cycle in the climate data and exploit the seasonal cycle of
inhomogeneities. In the daily temperature benchmarking study of
Killick (2016) most non-specialised detection methods performed
better than the daily detection method MAC-D (Rienzner and Gandolfi,
2013).
The selection of appropriate reference stations is a necessary step
for accurate detection and correction. Many different methods and
metrics are used for the station selection, but studies on the
optimal method are missing. The knowledge of local climatologists
which stations have a similar regional climate needs to be made
objective so that it can be applied automatically (at larger scales).
For detection a high signal to noise ratio is most important, while
for correction it is paramount that all stations are in the same
climatic region. Typically the same networks are used for both
detection and correction, but it should be investigated whether a
smaller network for correction would be beneficial. Also in general,
we need more research on understanding the performance of (monthly
and daily) correction methods.
Computing uncertainties
Also after homogenisation uncertainties remain in the data due to
various problems: Not all breaks in the candidate station have been
and can be detected.
False alarms are an unavoidable trade-off for detecting many real
breaks.
Uncertainty in the estimation of correction parameters due to
limited data.
Uncertainties in the corrections due to limited information on the
break positions.
From validation and benchmarking studies we have a reasonable idea
about the remaining uncertainties that one can expect in the
homogenised data, at least with respect to changes in the long-term
mean temperature. For many other variables and changes in the
distribution of (sub-)daily temperature data individual developers
have validated their methods, but systematic validation and
comparison studies are still missing.
Furthermore, such studies only provide a general uncertainty level,
whereas more detailed information for every single station/region and
period would be valuable. The uncertainties will strongly depend on
the signal to noise ratios, on the statistical properties of the
inhomogeneities of the raw data and on the quality and
cross-correlations of the reference stations. All of which vary
strongly per station, region and period.
Communicating such a complicated errors structure, which is mainly
temporal, but also partially spatial, is a problem in itself.
Furthermore, not only the uncertainty in the means should be
considered, but, especially for daily data, uncertainties in the
complete probability density function need to be estimated and
communicated. This could be communicated with an ensemble of possible
realisations, similar to Brohan et al. (2006).
An analytic understanding of the uncertainties is important, but is
often limited to idealised cases. Thus also numerical validation
studies, such as the past HOME and upcoming ISTI studies are
important for an assessment of homogenisation algorithms under
realistic conditions.
Creating validation datasets also help to see the limits of our
understanding of the statistical properties of the break signal. This
is especially the case for variables other than temperature and for
daily and (sub-)daily data. Information is needed on the real break
frequencies and size distributions, but also their auto-correlations
and cross-correlations, as well as explained in the next section the
stochastic nature of breaks in the variability around the mean.
Validation studies focussed on difficult cases would be valuable for
a better understanding. For example, sparse networks, isolated island
networks, large spatial trend gradients and strong decadal
variability in the difference series of nearby stations (for example,
due to El Nino in complex mountainous regions).
The advantage of simulated data is that it can create a large number
of quite realistic complete networks. For daily data it will remain
hard for the years to come to determine how to generate a realistic
validation dataset. Thus even if using parallel measurements is
mostly limited to one break per test, it does provide the highest
degree of realism for this one break.
Deterministic or stochastic corrections?
Annual and monthly data is normally used to study trends and
variability in the mean state of the atmosphere. Consequently,
typically only the mean is adjusted by homogenisation. Daily data, on
the other hand is used to study climatic changes in weather
variability, severe weather and extremes. Consequently, not only the
mean should be corrected, but the full probability distribution
describing the variability of the weather.
The physics of the problem suggests that many inhomogeneities are
caused by stochastic processes. An example affecting many instruments
are differences in the response time of instruments, which can lead
to differences determined by turbulence. A fast thermometer will on
average read higher maximum temperatures than a slow one, but this
difference will be variable and sometimes be much higher than the
average. In case of errors due to insolation the radiation error
will be modulated by clouds. An insufficiently shielded thermometer
will need larger corrections on warm days, which will typically be
more sunny, but some warm days will be cloudy and not need much
correction, while other warm days are sunny and calm and have a dry
hot surface. The adjustment of daily data for studies on changes in
the variability is thus a distribution problem and not only a
regression bias-correction problem. For data assimilation (numerical
weather prediction) accurate bias correction (with regression
methods) is probably the main concern.
Seen as a variability problem, the correction of daily data is
similar to statistical downscaling in many ways. Both methodologies
aim to produce bias-corrected data with the right variability, taking
into account the local climate and large-scale circulation. One
lesson from statistical downscaling is that increasing the variance
of a time series deterministically by multiplication with a fraction,
called inflation, is the wrong approach and that the variance that
could not be explained by regression using predictors should be added
stochastically as noise instead (Von Storch, 1999). Maraun (2013)
demonstrated that the inflation problem also exists for the
deterministic Quantile Matching method, which is also used in daily
homogenisation. Current statistical correction methods
deterministically change the daily temperature distribution and do
not stochastically add noise.
Transferring ideas from downscaling to daily homogenisation is likely
fruitful to develop such stochastic variability correction methods.
For example, predictor selection methods from downscaling could be
useful. Both fields require powerful and robust (time invariant)
predictors. Multi-site statistical downscaling techniques aim at
reproducing the auto- and cross-correlations between stations (Maraun
et al., 2010), which may be interesting for homogenisation as well.
The daily temperature benchmarking study of Rachel Killick (2016)
suggests that current daily correction methods are not able to
improve the distribution much. There is a pressing need for more
research on this topic. However, these methods likely also performed
less well because they were used together with detection methods with
a much lower hit rate than the comparison methods.
The deterministic correction methods may not lead to severe errors in
homogenisation, that should still be studied, but stochastic methods
that implement the corrections by adding noise would at least
theoretically fit better to the problem. Such stochastic corrections
are not trivial and should have the right variability on all temporal
and spatial scales.
It should be studied whether it may be better to only detect the
dates of break inhomogeneities and perform the analysis on the
homogeneous subperiods (removing the need for corrections). The
disadvantage of this approach is that most of the trend variance is
in the difference in the mean of the HSPs and only a small part is in
the trend within the HPSs. In case of trend analysis, this would be
similar to the work of the Berkeley Earth Surface Temperature group
on the mean temperature signal. Periods with gradual inhomogeneities,
e.g., due to urbanisation, would have to be detected and excluded
from such an analysis.
An outstanding problem is that current variability correction methods
have only been developed for break inhomogeneities, methods for
gradual ones are still missing. In homogenisation of the mean of
annual and monthly data, gradual inhomogeneities are successfully
removed by implementing multiple small breaks in the same direction.
However, as daily data is used to study changes in the distribution,
this may not be appropriate for daily data as it could produce larger
deviations near the breaks. Furthermore, changing the variance in
data with a trend can be problematic (Von Storch, 1999).
At the moment most daily correction methods correct the breaks one
after another. In monthly homogenisation it is found that correcting
all breaks simultaneously (Caussinus and Mestre, 2004) is more
accurate (Domonkos et al., 2013). It is thus likely worthwhile to
develop multiple breakpoint correction methods for daily data as
well.
Finally, current daily correction methods rely on previously detected
breaks and assume that the homogeneous subperiods (HSP) are
homogeneous (i.e., each segment between breakpoints assume to be
homogeneous) . However, these HSP are currently based on detection of
breaks in the mean only. Breaks in higher moments may thus still be
present in the "homogeneous" sub periods and affect the
corrections. If only for this reason, we should also work on
detection of breaks in the distribution.
Correction as model selection problem
The number of degrees of freedom (DOF) of the various correction
methods varies widely. From just one degree of freedom for annual
corrections of the means, to 12 degrees of freedom for monthly
correction of the means, to 40 for decile corrections applied to
every season, to a large number of DOF for quantile or percentile
matching.
A study using PRODIGE on the HOME benchmark suggested that for
typical European networks monthly adjustment are best for
temperature; annual corrections are probably less accurate because
they fail to account for changes in seasonal cycle due to
inhomogeneities. For precipitation annual corrections were most
accurate; monthly corrections were likely less accurate because the
data was too noisy to estimate the 12 correction constants/degrees of
freedom.
What is the best correction method depends on the characteristics of
the inhomogeneity. For a calibration problem just the annual mean
could be sufficient, for a serious exposure problem (e.g., insolation
of the instrument) a seasonal cycle in the monthly corrections may be
expected and the full distribution of the daily temperatures may need
to be adjusted. The best correction method also depends on the
reference. Whether the variables of a certain correction model can be
reliably estimated depends on how well-correlated the neighbouring
reference stations are.
An entire regional network is typically homogenised with the same
correction method, while the optimal correction method will depend on
the characteristics of each individual break and on the quality of
the reference. These will vary from station to station, from break to
break and from period to period. Work on correction methods that
objectively select the optimal correction method, e.g., using an
information criterion, would be valuable.
In case of (sub-)daily data, the options to select from become even
larger. Daily data can be corrected just for inhomogeneities in the
mean (e.g., Vincent et al., 2002, where daily temperatures are
corrected by incorporating a linear interpolation scheme that
preserves the previously defined monthly corrections) or also for the
variability around the mean. In between are methods that adjust for
the distribution including the seasonal cycle, which dominates the
variability and is thus effectively similar to mean adjustments with
a seasonal cycle. Correction methods of intermediate complexity with
more than one, but less than 10 degrees of freedom would fill a gap
and allow for more flexibility in selecting the optimal correction
model.
When applying these methods (Della-Marta and Wanner, 2006; Wang et
al., 2010; Mestre et al., 2011; Trewin, 2013) the number of quantile
bins (categories) needs to be selected as well as whether to use
physical weather-dependent predictors and the functional form they
are used (Auchmann and Brönnimann, 2012). Objective optimal methods
for these selections would be valuable.
Related information
WMO Guidelines on Homogenization (English, French, Spanish)
WMO guidance report: Challenges in the Transition from Conventional to Automatic Meteorological Observing Networks for Long-term Climate Records