Tuesday, 13 September 2016

Publish or perish is illegal in Germany, for good reason


Had Albert Einstein died just after his wonder year 1905 he would only have had a few publications on special relativity, the equivalence of mass and energy, Brownian motion and Photoelectric Effect on his name and would nowadays be seen as a mediocre researcher. He got the Nobel prize in 1921 "for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect", not for relativity, not for Brownian motion. This illustrates how hard it is to judge scientific work, even more than a decade afterwards, much less in advance.
Managing scientists is hard. It is nearly impossible to determine who will do a good job, who is doing a good job and even whether someone did a good job in the past. The last decades science managers have largely given up trying to assess how good a scientist is in most of the world and instead assess how many articles they write and how high the prestige is of the journals the articles appear in.

Unsurprisingly, this has succeeded in increasing the number of articles scientists write. Especially in America scientists are acutely aware that they have to publish or perish.

Did this hurt scientific progress? It is unfortunately impossible to say how fast science is progressing and how fast it could progress. The work is about the stuff we do not understand yet after all. The big steps, evolution, electromagnetism, quantum mechanics, have become rare the last decades. Maybe the low hanging fruit is simply gone. Maybe it is also modern publish-or-perish management.

There are good reasons to expect publish-or-perish management to be detrimental.
1. The most basic reason: The time spend writing and reading articles the ever increasing number of articles is not spend on doing research. (I hope no one is so naive as to think that the average scientist actually became several times more productive.)
2. Topics that quickly and predictably lead to publications are not the same topics that will bring science forward. I personally try to write a mix because only working on more risky science you expect is important is unfortunately too dangerous.
3. The stick and carrot type of management works for manual labor, but for creative open-ended work it is often found to be detrimental. For creative work mastery and purpose are the incentives.

German science has another tradition, trusting scientists more and focusing on quality. This is expressed in the safeguards for good scientific practice of the German Science Foundation (DFG). It explicitly forbids the use of quantitative assessments of articles.
Universities and research institutes shall always give originality and quality precedence before quantity in their criteria for performance evaluation. This applies to academic degrees, to career advancement, appointments and the allocation of resources. …

criteria that primarily measure quantity create incentives for mass production and are therefore likely to be inimical [harmful] to high quality science and scholarship. …

Quantitative criteria today are common in judging academic achievement at all levels. … This practice needs revision with the aim of returning to qualitative criteria. … For applications for academic appointments, a maximum number of publications should regularly be requested for the evaluation of scientific merit.
For a project proposal to the German Science Foundation this "maximum number" means that you are not allowed to list all your publications, but only the 6 best ones (for a typical project, smaller projects even less).

[UPDATE. This limit has unfortunately now been increased to 10. They say the biologists are to blame.]

While reading the next paragraphs, please hear me screaming YES, YES, YES in your ear at an unbearable volume.
An adequate evaluation of the achievements of an individual or a small group, however, always requires qualitative criteria in the narrow sense: their publications must be read and critically compared to the relevant state of the art and to the contributions of other individuals and working groups.

This confrontation with the content of the science, which demands time and care, is the essential core of peer review for which there is no alternative. The superficial use of quantitative indicators will only serve to devalue or to obfuscate the peer review process.
I fully realize that actually reading someone’s publications is much more work than counting them and that top scientists spend a large part of their time reviewing. In my view that is a reason to reduce the number of reviews and trust scientists more. Hire people who have a burning desire to understand the world, so that you can trust them.

Sometimes this desire goes away when people get older. For the outside world this is most visible in some older participants of the climate “debate” who hardly produce new work trying to understand climate change, but use their technical skills and time to deceive the public. The most extreme example I know is a professor who was painting all day long, while his students gave his lectures. We should be able to get rid of such people, but there is no need for frequent assessments of people doing their job well.

You also see this German tradition in the research institutes of the Max Planck Society. The directors of these institutes are the best scientists of the world and they can do whatever they think will bring their science forward. Max Planck Director Bjorn Stevens describes this system in the fourth and best episode of the podcast Forecast. The part on his freedom and the importance of trust starts at minute 27, but best listen to the whole inspiring podcast about which I could easily write several blog posts.

Stevens started his scientific career in the USA, but talks about the German science tradition when he says:
I can think of no bigger waste of time than reviewing Chris Bretherton’s proposals. I mean, why would you want to do that? They guy has shown himself to have good idea, after good idea, after good idea. At some point you say: go doc, go! Here is your budget and let him go. This whole industry that develops to keep someone like Chris Bretherton on a leash makes no sense to me.
Compare scientists who sets priorities within their own budgets with scientists who submit research proposals judged by others. If you have your own budget you will only support what you think is really important; if you do A, you cannot do B. Many project proposals are written to fit into a research program, because a colleague wants to collaborate and apart from the time wasted on writing it, there are no downsides for asking for more funding. If you have your own budget, the person with the most expertise and with the most skin in the game decides. This while they call the project funding, where the deciders have no skin in the game, competitive. It is Soviet style planning; that it works at all shows the dedication and altruism of the scientists involved. Those are scientists you could simply trust.

I hope this post will inspire the scientific community to move towards more trust in scientists, increase the fraction of unleashed researchers and reduce the misdirected quantitative micro-management. Please find below the full text of the safeguards of the German Science Foundation on performance evaluation; above I had to skip many worthwhile parts.



Recommendation 6: Performance Evaluation

Universities and research institutes shall always give originality and quality precedence before quantity in their criteria for performance evaluation. This applies to academic degrees, to career advancement, appointments and the allocation of resources.

Commentary
For the individual scientist and scholar, the conditions of his or her work and its evaluation may facilitate or hinder observing good scientific practice. Conditions that favour dishonest conduct should be changed. For example, criteria that primarily measure quantity create incentives for mass production and are therefore likely to be inimical to high quality science and scholarship.

Quantitative criteria today are common in judging academic achievement at all levels. They usually serve as an informal or implicit standard, although cases of formal requirements of this type have also been reported They apply in many different contexts: length of Bachelor, Master or PhD thesis, number of publications for the Habilitation (formal qualification for university professorships in German speaking countries), as criteria for career advancements, appointments, peer review of grant proposals, etc. This practice needs revision with the aim of returning to qualitative criteria. The revision should begin at the first degree level and include all stages of academic qualification. For applications for academic appointments, a maximum number of publications should regularly be requested for the evaluation of scientific merit.

Since publications are the most important “product” of research, it may have seemed logical, when comparing achievement, to measure productivity as the number of products, i.e. publications, per length of time. But this has led to abuses like the so-called salami publications, repeated publication of the same findings, and observance of the principle of the LPU (least publishable unit).

Moreover, since productivity measures yield little useful information unless refined by quality measures, the length of publication lists was soon complemented by additional criteria like the reputation of the journals in which publications appeared, quantified as their “impact factor” (see section 2 5).

However, clearly neither counting publications nor computing their cumulative impact factors are by themselves adequate forms of performance evaluation. On the contrary, they are far removed from the features that constitute the quality element of scientific achievement: its originality, its “level of innovation”, its contribution to the advancement of knowledge. Through the growing frequency of their use, they rather run the danger of becoming surrogates for quality judgements instead of helpful indicators.

Quantitative performance indicators have their use in comparing collective activity and output at a high level of aggregation (faculties, institutes, entire countries) in an overview, or for giving a salient impression of developments over time. For such purposes, bibliometry today supplies a variety of instruments. However, they require specific expertise in their application.

An adequate evaluation of the achievements of an individual or a small group, however, always requires qualitative criteria in the narrow sense: their publications must be read and critically compared to the relevant state of the art and to the contributions of other individuals and working groups.

This confrontation with the content of the science, which demands time and care, is the essential core of peer review for which there is no alternative. The superficial use of quantitative indicators will only serve to devalue or to obfuscate the peer review process.

The rules that follow from this for the practice of scientific work and for the supervision of young scientists and scholars are clear. They apply conversely to peer review and performance evaluation:
  • Even in fields where intensive competition requires rapid publication of findings, quality of work and of publications must be the primary consideration. Findings, wherever factually possible, must be controlled and replicated before being submitted for publication.
  • Wherever achievement has to be evaluated — in reviewing grant proposals, in personnel management, in comparing applications for appointments — the evaluators and reviewers must be encouraged to make explicit judgements of quality before all else. They should therefore receive the smallest reasonable number of publications — selected by their authors as the best examples of their work according to the criteria by which they are to be evaluated.

Related information

Nature on new evaluation systems in The Netherlands and Ireland: Fewer numbers, better science

Episode 4 of Forecast with Max Planck Director Bjorn Stevens on clouds, aerosols, science and science management. Highly recommended.

Memorandum of the German Science Foundation: Safeguarding Good Scientific Practice. English part starts at page 61.

On of my first posts explaining why stick and carrot management makes productivity worse for cognitive tasks: Good ideas, motivation and economics

* Photo of Albert Einstein at the top is in the public domain.

3 comments:

izen said...

Some years ago I was asked (indirectly) to advise on an incentive bonus scheme that a small computer company wanted to set up for its workforce.
The initial idea from the finance department was to grade everyone on a number of criteria;- Reliability, initiative, punctuality etc on a scale from 1 to 10 and then combine the scores in some weighted system to give the amount of bonus a person would get.

I had two critiques of this'
1)- The combination of different measures, some numerical like punctuality and some subjective assessments like reliability was mixing apples and oranges, and then forcing it into a ten point scale that was arbitrary. My experience with other quality assessment systems has led me to the view that three levels is the most that is applicable to quality assessment. Basically you have; Good/Normal/Bad. Or in systems were negative assessment is not acceptable, Excellent/Good/Satisfactory. You can argue about the borderlines, but if your criteria are to be effective the number in the central group, the normals, will be the majority. The extremes, Good/Bad will be a smaller number than the average/normal group. Any attempt to make finer distinctions is invariable a futile exercise. It provides a false sense of increased accuracy without any real improvement in the information.
2)- The way the company worked was it sent teams out to install hardware. The membership of teams was fluid and was altered by the need for specific skills or numbers and friendship links. So a person might have a high score in one team, but low in another depending on the type of work and how well they got on with the team leader and assessor. Because the bonus was based on the individual score but was dependent on team dynamics it was intrinsically flawed, rewarding a person for something the team as a whole might be responsible for. That could be divisive.

The feedback I got was not positive. The idea that only three grades of assessment might be possible for quality was rejected. It seemed obvious to those considering the scoring system that by having a ten point scale they had increased the accuracy and usefulness of the information, not imposed and arbitrary and misleading metric.
The point about personal assessments being affected by the team context was taken as an indication that the personal score could be used to decide when to move someone from a team to improve their score.

It is extremely difficult to make any accurate measure of the quality of work done by people who are not working on a fixed, defined task. If creativity of scientific or clinical knowledge is involved it get harder. That does not prevent those who wish to measure quality from finding some aspect of the activity that can be reduced to a numerical measure and then asserting that the number is a meaningful and accurate proxy for the quality they claim to be measuring. And that the finer the gradations of the number the more accurate the measurement.

Historically the one thing that does have a strong correlation with successful scientific research is throwing money at it. It is noticeable how often modern biochemicals, drugs and organic compounds have as date and location of the first isolation and synthesis, the big German investment in Chemical research in the 1920s.

I do not know how well the incentive scheme worked at the computer company. It was of course a means of capping the wage bill. By having a bonus scheme the percentage of the company earnings that went to the workforce could be controlled by using the incentive scores to distribute a smaller slice of the company cake.

I concluded some time ago that the accountants had taken over the asylum.

protonsforbreakfast said...


Victor

Thank you for this article. I have two things to say: the first refers to the concept of 'Quality' and second to Goodhart's Law.

In Zen and the Art of Motorcycle Maintenance, the protagonist seeks a definition of 'Quality': struggling to find if it can be specified exactly or if it is just a matter of opinion. In the end he arrives at the idea that 'Quality' is a measure of the amount of attention that another human being has paid to something.There are some quantitative aspects to quality - but the key thing is that is essentially a human to human communication. In scientific literature it requires a human being to read a paper to determine its quality - no matter how many citations it has.

The Second point is Goodhart's Law

https://en.wikipedia.org/wiki/Goodhart%27s_law

This states that once one uses a simple measurement as a proxy for the thing one means to measure, people will 'game' the system and the measurement will become worthless. This kind of 'quantitative thinking' is like a cultural infestation - thank you for speaking so clearly about its ill effects.

My comment on the effect is here:

https://protonsforbreakfast.wordpress.com/2015/10/05/target-culture-and-the-road-to-hell/

Michael

Victor Venema said...

Goodhart's Law is important to mention. The above problem is not just a scientific one, but a destructive micro-management technique that is quite common. I like the BBC Reith lecture on this a lot: A question of trust.