Sunday, 8 January 2012

What distinguishes a benchmark?

Benchmarking is a community effort

Science has many terms for studying the validity or performance of scientific methods: testing, validation, intercomparison, verification, evaluation, and benchmarking. Every term has a different, sometimes subtly different, meaning. Initially I had wanted to compare all these terms with each other, but that would have become a very long post, especially as the meaning for every term is different in business, engineering, computation and science. Therefore, this post will only propose a definition for benchmarking in science and what distinguishes it from other approaches, casually called other validation studies from now on.

In my view benchmarking has three distinguishing features.
1. The methods are tested blind.
2. The problem is realistic.
3. Benchmarking is a community effort.
The term benchmark has become fashionable lately. It is also used, however, for validation studies that do not display these three features. This is not wrong, as there is no generally accepted definition of benchmarking. In fact in an important article on benchmarking by Sim et al. (2003) defines "a benchmark as a test or set of tests used to compare the performance of alternative tools or techniques." which would include any validation study. Then they limit the topic of their article, however, to interesting benchmarks, which are "created and used by a technical research community." However, if benchmarking is used for any type of validation study, there would not be any added value to the word. Thus I hope this post can be a starting point for a generally accepted and a more restrictive definition.


Hopefully this post will be useful for all natural sciences, but my view may be biased by my participation in the COST Action HOME, which benchmarked homogenization algorithms for monthly surface temperature and precipitation data and my membership in the Benchmarking & Assessment Working Group of the International Surface Temperature Initiative, ISTI, whose focus will be a global temperature benchmark for homogenization algorithms. HOME has recently ended and just published its main paper: "Benchmarking homogenization algorithms for monthly data". The ISTI has just started and describes its benchmarking in a white paper: "Benchmarking homogenisation algorithm performance against test cases". Both the ISTI white paper and the HOME article are strongly influenced by a beautiful paper by Sim, Easterbrook and Holt (2003) on benchmarking in software engineering, which I can recommend reading to everyone interested in benchmarking. Disclosure: Steven Easterbrook, the second author of this paper, is also in the Benchmarking Working Group.

The methods are tested blind on a realistic problem

Also Wikipedia starts with a rather general definition of benchmarking that would include any validation study:
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it.
However, the article then continues to state that benchmarks are needed because specifications of hard- and software are not trivially comparable. Often trade-offs need to be made, e.g. in computing between CPU-cycles and memory requirements, or in homogenization between the hit rate (correct detection of a jump in the data) and the false alarm rate. This then directly leads to needing a realistic, real life problem to be solved. Wikipedia then continues to state that one of the challenges is that vendors start tuning their products to do well on benchmarks. This again highlights the importance of realism as well as blind testing.

Both criteria are somewhat situation dependent. If the benchmark is sufficiently realistic, there would not be much opportunity for tuning. Consequently blind testing becomes less important. If trade-offs are not important, even realism may not be necessary. The I3RC project of the atmospheric 3-dimensional radiative transfer community has an academic case study with a 1-dimensional cloud, next to realistic 3D clouds for which these codes are developed. Interestingly, this project started as an intercomparison project, as the truth is not known, nowadays the average of the contributions is seen as benchmark solution for new codes.

Benchmarking is a community effort

The paper by Sim et al. (2003) is mainly about the community aspects of benchmarking and how defining and working on a benchmark can help a community to mature, to make progress:
The critical insight of our theory of benchmarking is this: Within research communities, benchmarks operationalize scientific paradigms, that is, they are a statement of the discipline's research goals and they emerge through a synergistic process of technical knowledge and social consensus proceeding in tandem.
From this insight it is clear that the benefit of benchmarking is greatest when the community is involved as much as possible. A COST Action mainly provides money for meetings with talks, debate and dinners and thus perfectly facilitates community building. In this respect COST Action are ideally suited to work on benchmarking problems. The HOME benchmark dataset is public to allow new people to enter the community and benchmark new algorithms. Such tests would unfortunately not be blind anymore, but would still be valuable as the benchmark dataset is quite realistic.

Also the Benchmarking Working Group of the ISTI makes a large effort to involve the entire community, e.g. with a blog where the properties of the benchmark are discussed and the Working Group plans to write an article on the generation of the global temperature benchmark to facilitate public scrutiny.

Methodological diversity

Advantages of benchmarking are that the definition of the benchmark makes the problem to be solved clearer, as Sim et al. (2003) state it: "A benchmark operationalises a paradigm; it takes an abstract concept and makes it concrete". Especially for a field like climatology, which is under intense public scrutiny, a blind benchmark provides certainty that the results are an honest appraisal of the true power of the algorithms. And very important is the intense collaboration on one problem. To cite Sim et al. (2003) again:
These factors together, collaboration, openness, and publicness, result in frank, detailed, and technical communication among researchers. This kind of public evaluation contrasts sharply with the descriptions of tools and techniques that are currently found in software engineering conference or journal publications. A well written paper is expected to show that the work is a novel and worthy contribution to the field, rather than share advice about how to tackle similar practical problems. Benchmarks are one of the few ways that the dirty details of research, such as debugging techniques, design decisions, and mistakes, are forced out into the open and shared between laboratories.
However, one should also not be blind for disadvantages of benchmarking. It is therefore optimal to use a diversity of methodological approaches that includes various forms of validation studies as well as benchmarking.

One disadvantage is that the benchmark data needs to be realistic. For the understanding of the algorithms, or components thereof, simplified cases can be helpful. Another disadvantage is that you may find during the analysis that a contribution has some problems, which may just be a stupid programming error in converting the data to the requested format. In case such errors are found after the deadline, they can not be corrected. Thus benchmarking results may be suboptimal. Related to this is that if the solution is known, the dataset can be used test and optimize the algorithms. If a benchmark is not realistic, optimizing algorithms for it may move research into the wrong direction (Sim et al., 2003). There may be some noise in the benchmark results due to differences in experience and effort of the participants. In a validation studies all algorithms could operated by one person, which may reduce differences in experience and especially effort. Benchmarking as a community effort and with its need for a realistic problem needs a problem that is reasonably well understood. Without knowing the basics, discussions on details are not productive.

For these reasons, the benchmarking Working Group of the ISTI plans to generate a number of worlds (datasets) with realistic settings and some with more artificial settings. The realistic datasets will be used for blind benchmarking; the solutions will be kept secret. For some of the more artificial dataset the solution will also be provided, they can be used to play around and interactively study the performance of the algorithms.

Concluding remarks

I would propose to reserve the term benchmarking for community efforts to study the power of scientific methods. Depending on the situation, posing a realistic problem and testing the methods blind are important ingredients of benchmarking. Or maybe one could also state that "realism" and "blind testing" increase the interest of the community and thus help a dataset to become a benchmark.

More information on homogenisation of climate data

New article: Benchmarking homogenization algorithms for monthly data
Raw climate records contain changes due to non-climatic factors, such as relocations of stations or changes in instrumentation. This post introduces an article that tested how well such non-climatic factors can be removed.
Homogenization of monthly and annual data from surface stations
A short description of the causes of inhomogeneities in climate data (non-climatic variability) and how to remove it using the relative homogenization approach.
HUME: Homogenisation, Uncertainty Measures and Extreme weather
Proposal for future research in homogenisation of climate network data.
Statistical homogenisation for dummies
A primer on statistical homogenisation with many pictures.
Investigation of methods for hydroclimatic data homogenization
An example of the daily misinformation spread by the blog Watts Up With That? In this case about homogenization.


Sim, S. E., Easterbrook, S., and Holt, R. C.: Using Benchmarking to Advance Research: A Challenge to Software Engineering. Proceedings of the 25th International Conference on Software Engineering ICSE ’03, IEEE Computer Society Washington, DC, USA, ISBN: 0-7695-1877-X, 74-83, 2003.
Venema, V., O. Mestre, E. Aguilar, I. Auer, J.A. Guijarro, P. Domonkos, G. Vertacnik, T. Szentimrey, P. Stepanek, P. Zahradnicek, J. Viarre, G. Müller-Westermeier, M. Lakatos, C.N. Williams, M. Menne, R. Lindau, D. Rasol, E. Rustemeier, K. Kolokythas, T. Marinova, L. Andresen, F. Acquaotta, S. Fratianni, S. Cheval, M. Klancar, M. Brunetti, Ch. Gruber, M. Prohom Duran, T. Likso, P. Esteban, Th. Brandsma. Benchmarking monthly homogenization algorithms. Accepted by Climate of the Past, 2011.
Willett, K., M. Menne, P. Thorne, S. Brönnimann, I. Jolliffe, L. Vincent, and X. Wang. Benchmarking homogenisation algorithm performance against test cases. White paper presented at the workshop: Creating surface temperature datasets to meet 21st Century challenges, Met Office Hadley Centre, Exeter, UK, 7-9 September 2010.


Peter Domonkos said...

I would say rather:

"Benchmarking is a validation study in which the problem is important for a scientific community and the test dataset is accepted by that community as a proper access to achieve substantial improvement in solving the problem" - and would not stress the word "blind", because its inclusion in the definiton means an incorrect presumption of the researchers' honesty.

Victor Venema said...

That is a good point. Trust is very important in general and especially in science. Society cannot function without trust. The scientific community is build in a way that increases trust. For example by splitting up problems in smaller ones, which are well defined and whose solution can be reproduced. For this benchmarking study, it was important that the study was blind because of the lack of trust from the side of the self-proclaimed climate sceptics. Let’s hope that this post-normal period will soon revert to normal again.

My favourite Reith Lecture (a series of the BBC) is about trust, well worth hearing.

In industry, benchmarking of hardware and software is also often performed. By nature such participants are less interested in the truth and there is more money at stake. Thus for such benchmarks, blindness is more important, while still not essential if the benchmark is sufficiently realistic.