Today I downloaded and installed the r program for analysing Google Scholar citation metrics (you can pick it up here).
There is a lot of talk about the various metrics being used to analyse the productivity of scientists and there seems to be no really good way to do it. A simple point-statistic doesn't do it very well.
The questions we face are:
- 1. How do we rate a very highly-cited paper with 200 authors? If it is cited 2,000 times, then is it realistic to say that all 200 of these authors can say that they individually were cited 2,000 times? Do we divide the number of citations by the number of authors?
- 2. A scientist's age will clearly influence how many times he/she has been cited. It is unreasonable to expect somebody with a publication record of 2 years to have the same number of citations as a scientist who has been publishing for 20 years. Do we simply divide numbers of citations by number of years?
- 3. If you are one of the most important people in driving a project - you were awarded the funding, you directed the research or you actually designed and carried out the experiments - then you probably occupy a more prominent position in the author list of multi-authored papers. So how do we treat this? Does somebody who is buried in the middle of a large multi-authored paper get the same credit as the first author or the corresponding author?
- 4. How do we view the career of a scientist who has published 50 papers, but they were not really in high-impact journals, but were steadily cited over the years, versus a scientist who only published a paper every 5 years but they always turned out to be highly-cited and published in very high-impact journals? Is this an issue of productivity versus quality? Numbers of papers seem to me to indicate an ability to get work done, but science is influenced by insightful, key publications.
Therefore, it seems most sensible to summarise citation data in as many ways as possible. Google Scholar is making this easier and soon we should be easily able to understand the career structure of scientists.
The indices that are most-frequently used are the H-index (an author has a h-index of 'h' when they have h papers that have been cited h times at least). The G-Index is where the top G articles have together received G citations. The M-Index is the H-index divided by the number of years that a scientist has been active.
Take the following scenario. A scientist has published 20 papers and only one of them has ever been cited. This would give that scientist a h-index of 1. If that scientist had published the 20 papers over a period of 10 years, then their M-index would be 0.1. Neither of these indices would be too flattering for this scientist. However, if that single paper that was cited had received 400 citations, then the scientist would have a G-index of 20 (i.e. maximum for that scientist).
Therefore, I guess the point is that it is important to look at the data from a variety of angles before coming to the conclusion that somebody has a great publication output or not. A high h-index for somebody who has never been first or corresponding author on any paper is not unusual.
Possibly a more realistic example is of a scientist who is very well-known, but whose name I won't mention.
These are his stats:
Total papers = 268
Median citations per paper = 3
Median (citations / # of authors) per paper = 0.4285714
Mean citations per paper = 107.5746
H-index = 63
G-index = 169
M-index = 3.315789
First author H-index = 17
Last author H-index = 13
First or last author H-index = 22
First or second author H-index = 28
Probably the most curious statistic is the one where his median citations per paper is divided by the number of authors. At 0.43 citations, this is an exceptionally low number for a senior scientist, but it is somewhat typical for a scientist working in one of the genomics factories. He is the author of a large number of papers from genome projects. The papers report genome sequences, have large numbers of authors and consequently, if you divide up the citations evenly among the authors, each one doesn't get a lot of citations.
The other thing of interest is the difference between H-index and H-index when this scientist was first author or last author - i.e. one of the major drivers of the manuscript. While for most other statistics it would seem that this scientist has been much more productive (almost 300 papers), more highly cited (almost 30,000 citations) and those papers have been more influential (>100 citations per paper on average), a careful analysis will tell you that he has been working in a large research factory where large number of scientists get included on large numbers of papers.
Using another very successful scientist to make a comparison, I analysed somebody who has always been a pretty independent scientist, very productive, but generally has worked in collaboration with other scientists instead of working in any of the genomics factories.
Total papers = 123
Median citations per paper = 21
Median (citations / # of authors) per paper = 9
Mean citations per paper = 107.5854
H-index = 38
G-index = 115
M-index = 1.583333
First author H-index = 33
Last author H-index = 32
First or last author H-index = 37
First or second author H-index = 40
If number of papers or H-index were taken here as the only points of comparison, we would have a very different picture than if we were to compare a variety of metrics. Here, the Median citations per paper per author is more than 18 times higher than the previous scientist. Also of interest is the H-index where this author was a major driver of the science (First or last author H-index). For this scientist, the value is much higher. This is the imprint of a very clever and productive scientist who has consistently produced very insightful papers. The previous set of statistics are the imprint of a scientist who was an important member of a large organisation and contributed to the work of that organisation and was a co-author on many of their papers, probably also a very clever and important scientist. Of interest is the fact that the scientist in Case A is first or last author on only 6 of his top 40 most highly-cited papers.
As always, when metrics are being used to assess quality, people will look for ways to game the system. The obviously easy one to do is to make a deal with your buddy for reciprocal authorship on manuscripts. I have seen this happening and its not fair in my opinion. Authorship should only be a consequence of a real and meaningful contribution to the research. Journals demand this now, but ultimately it is left to the honesty of the scientists to declare that every author has made a contribution. However, in reality it is still a system that is being gamed and the ultimate value of any of these indices will always rely on the honesty of the authorship and that does not always happen.
For the record, here are my stats today, according to Google Scholar.
Total papers = 73
Median citations per paper = 11
Median (citations / # of authors) per paper = 4
Mean citations per paper = 27.06849
H-index = 24
G-index = 43
M-index = 1.5
First author H-index = 9
Last author H-index = 15
First or last author H-index = 18
First or second author H-index = 18
i10-index = 41
Recent i10 index = 33
Therefore, I have a much better M-index than Albert Einstein (True, he died a long time ago) and also my median citations per paper and median citations per paper per author are better than his. I think this makes my point. Einstein published 707 papers.
Also, for fun, below you will find a wordcloud of my co-authors on the left and the most frequently used words in the publication titles on the right.
Please feel free to leave a comment or to use the social media buttons below to share this post.