Highlights EUMETNET Data Management Workshop 2013

The Data Management Workshop (DMW) had four main themes: data rescue, homogenization, quality control and data products. Homogenization was clearly the most important topic with about half of the presentations and was also the main reason I was there. Please find below the highlights I expect to be more interesting. In retrospect this post has quite a focus on organizational matters, mainly because this was most new to me.

The DMW is different from the Budapest homogenization workshops in that it focused more on best practices at weather services and Budapest more on the science and the development of homogenization methods. One idea from the workshop is that it may be worthwhile to have a counterpart to the homogenization workshop in the field of quality control.

BREAKING NEWS: Tamas Szentimrey announced that the 8th Homogenization seminar will be organized together with 3rd interpolation seminar in Budapest on 12-16 May 2014.

UPDATE: The slides of many presentations can now be downloaded.

Data rescue and management

Peer Hechler (Slides) of the World Climate Data and Monitoring Programme (WCDMP) of World Meteorological Organization (WMO) gave an overview of the data rescue activities going on, especially in the WMO Region VI (Europe and the Middle East). Important activities are MEDARE and the data rescue at the Western Balkan, Jordan and Palestine.

The climate data itself is "just" a lot of work, finding and digitising metadata (station histories) is the big challenge. The WMO Guidelines on data rescue from 2004 (PDF file) are still valid. There are now multiple information sources on data rescue and with many independent initiatives going on there is a need to coordination. The WMO initiative iDARE is setting up an international data rescue portal, the homepage www.climatol.eu/DARE is an example of where it should go.

Also the data management itself is still a concern. More than half of the WMO members do not have a proper data management system according to a recent survey of the WMO commission on climate (doc file with results). These numbers are likely worse over all; the data management of the non-responders is likely not better. Furthermore, the WMO is working towards a global climate data management framework, which should be based on existing databases and make them interoperable.

The shutdown of climate stations is a problem of much concern. Even climatologically valuable long climate series are endangered by budget cuts. The WMO is working towards official recognition of centennial climate stations, in the hope that that will improve their protection in case of budget cuts.

The WMO has written a proposal to update the climatological normals every decade, but it retains the 1961-90 for long term climate change assessments and keeps that unchanged as long as there are no scientific reasons to do so. The new 10-year periods are intended more for user products.

Also Ingeborg Auer (slides), chair of a climate support group in EUMETNET, reminded that the IPCC has written that we should study the natural variability of the undisturbed climate in detail. For this EUMETNET will focus on really long series from instrumental data (centennial stations). As mountain stations are sparse and important for climatology, they will also be included in this initiative if they are more than 50 years long.

Aryan van Engelen (slides) presented the initiatives to implement projects similar to the European Climate and Assessment (ECA&D) and the Expert Team on Climate Change Detection and Indices (ETCDDI) on other continents. There are initiative for South Asia (SACA&D: 6000 series, 4100 stations, 34% downloadable; this has been achieved in just a few years), West Africa (WACA&D), and Latin America (LACA&D).

Homogenization

Enric Aguilar (slides) presented his work on using parallel measurements to study the transition from conventional observations (manual with Stevenson screen) to automatic weather stations in Spain. This transition is an important inhomogeneity in many regions and will continue in the foreseeable future. It may bias our climate record. First results showed that 50% of the differences in the mean are within -0.5 and 0.5 °C. Some, but not all have a seasonal cycle; this makes it likely that also the distribution around the mean will be affected by this transition.

The beauty of this study is that an entire network and not "just" one pair of instruments is studied. This gives much better statistics, especially as the results higher vary from location to location. We will try to extend this study to other national parallel networks (Peru, Germany, Romania, …). Many of the parallel datasets contained inhomogeneities, which can be found very easily due to the high cross correlations and should be removed prior to studying the differences.

With such a dataset, also daily correction methods for the temperature distribution can be validated very well. First results showed that the percentile matching (PM) of Blair Trewin clearly outperformed simple regression models. Still about 15% of the stations became worse by applying PM, maybe due to bad detection of breaks in the neighboring reference stations and extremely large breaks. It should be noted that homogenization can only improve climate data on average and that sometimes correcting a false break is unavoidable for the best methods.

Peter Domonkos (slides) presented recent improvement of his homogenization method ACMANT. The method can now also handle daily data and will correct its mean values based on the monthly corrections of the means. For temperature, ACMANT detects inhomogeneities in the annual means and the magnitude of the seasonal cycle. The new version of ACMANT does something similar for precipitation and uses the annual precipitation sums and the difference between the rain and snow seasons, which could detect changes in undercatchment to which snow is more prone as rain. The user has to set the length of these two periods by hand.

Another interesting innovation is the use of a quasi-logarithmic function to transform the precipitation values to a Gaussian distribution. This function gives less weight to small precipitation values as the commonly used logarithmic transform. The idea behind this is that with the logarithmic transform the difference between 10 to 20 mm would be seen as just important as the difference between 1 and 2 mm. The latter, however, is likely less important if only because such low precipitation rates are difficult to measure reliably.

Peter Domonkos recently wrote a guest post on this blog that the homogenization methods recommended by HOME, which are considerably more accurate as traditional methods, have not been adopted much yet in recent papers in the scientific literature. This workshop gave another impression. Maybe we were just being impatient. Especially many people have started using HOMER, the homogenization method that is based on the best algorithms of the HOME recommended methods. (A validation of HOMER is still missing, however.)

Eirik Forland (slides) presented a lovely work of the Norwegian Meteorological Service. They are resurrecting automatic weather stations at locations where in historical times used to be ones. They do so especially at ancient hunter stations around Svalbard in the North of Norway. This is an interesting region because the climate variability is very large, not only now, but also in the past, but unfortunately the number of stations is very limited. You can actually still find the remnants of some old weather stations. Some of the precipitation trends in the Arctic could also be due to warming and thus more rain relative to snow and less underestimations of undercatchment.

Maybe we should also have a look at the interaction between QC and homogenization. The data quality can differ between decades, sometimes there are decade long periods with bad data. Clara Oria showed an example of such a dataset in Peru, see picture below, likely due to a bad thermograph.

This temperature time series for Anta-Cusco has quality problems for over one decade. Figure seen on poster of Clara Oria and Stefanie Gubler on "adaptation of the data quality control procedures of conventional stations at SENAMHI Peru within the project CLIMANDES".

In the discussion is was noted that we have work a lot on temperature and somewhat less on precipitation, but that inhomogeneities in many other climate variables are not studied much and also not homogenized much. For adaptation in relation to health, we for example, need more parameters, an important one would be humidity. Also pressure is not homogenized much, it would be relatively easy as spatial correlations are strong and important as pressure is the basis of many meteorological phenomena.

Another discussion point was what we learn about network design from quality control and homogenization. One lesson is, I would argue, that it is good to cluster stations. Three stations nearby each other lend themselves to good quality control and homogenization. However, it is custom to spread weather stations as well as possible (although I remember a paper by Shaun Lovejoy, if I am correct, that did show clustering of climate stations, fitting to fractal pattern in human habitation). Also in selecting stations for digitisation, people often aim at spreading, while stations with nearby neighbours are very valuable because of the higher quality their homogenized data can achieve. Disclosure: I am working with ZAMG (Austrian MetOffice) on the homogenization of station observations of humidity.

Quality control

It is completely not my field, but Christian Sigg (slides) presented a beautiful statistical approach to quality control using machine learning methods. It uses one-class support vector machine to learn the typical behavior of a station based on past behavior. Not only the typical distribution of a variable is taken into account, but also its relations with other variables. (No precipitation is expected under cloud free conditions, for example.)

A problem here is the curse of dimensionality, as many different relations are possible and there is only little data to learn the relationships. Another problem is that data problems that occur regularly may be seen as normal behavior.

I like this method, if only because the statistical approach makes the method more objective and reproducible. Now quality control is often based on thresholds, which set flags that are later checked by climatologists looking at the weather situation on that day and possibly contacting the observer that made the observation. This manual approach does provide more protection against true extreme values being marked as outliers. Apart from that, the statistical approach promises a higher accuracy as using somewhat arbitrary rules and thresholds. Furthermore, it is more widely applicable (for a temperature reading, 40 degrees could be a good upper threshold for Norway, but would be too low for Greece) and would thus be well suited for a global dataset, for example the one of the International Surface Temperature Initiative (ISTI).

Related information

At Data rescue at home, volunteers and weather enthusiasts can digitise historical weather data from all over the globe. In return, the data will be made available to the public without any restriction.

The MEditerranean climate DAta REscue (MEDARE) is an initiative, born under the auspice of the World Meteorological Organization, with the main objective is being to develop, consolidate and progress climate data and metadata rescue activities across the Greater Mediterranean Region (GMR).

The slides of my presentation at the DMW2013, parallel measurements to study inhomogeneities in daily data , can be downloaded here.

Variable Variability

Pages

Tuesday, 12 November 2013