Richard G. Brereton*
School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, UK. E-mail: r.g.brereton@bris.ac.uk; Tel: +44 (0)1173295114
Some say chemometrics involves primarily statistics in analytical chemistry, in which case the discipline existed from the early 20th century. Major advances, for example, were made in industrial process optimisation in the 1950s, following on from the war effort in many Western countries. Hence the disciplines of traditional univariate statistics, experimental design and optimisation, especially as tools for the chemical industry, were well established by the 1950s. However, what we now recognise as chemometrics is primarily a multivariate and computer based discipline. In spite of this, even the idea of PCA (Principal Components Analysis)1 was available in the chemical literature in the nineteenth century. In the late 1960s and early 1970s, though, computers became more common, and several investigators were actively developing multivariate approaches for the interpretation of analytical chemical data. These include Peter Jurs2 and Ed Malinowski3 who went on to publish seminal books and papers in the areas of pattern recognition and factor analysis respectively. In the early 1970s, Luc Massart entered the field from the perspective of chromatography.4
The common entry point of the pre-1960s pioneers was statistics, whereas those from the 1960s onwards were primarily programmers, either they or their close collaborators were good at programming primitive mainframes often in Fortran or related languages. Possibly the catalyst for modern chemometrics is multivariate computerised pattern recognition, and in this article we will accept this as the birth point, when chemometrics becomes evolutionarily distinct from its ancestors. Obviously there is no single agreement as to what the discipline consists of.
Since the late 1960s, there has been an explosion in the use of multivariate techniques such as PCA5,6 and PLS7,8 as routine analytical techniques, especially in areas such as NIR (near infrared) spectroscopy. But despite this, chemometrics as a discipline probably reached its zenith around a decade ago. Since that time there has been a gradual decline in chemometricians. Paul Geladi and Phil Hopke suggest in an article published both in Chemometrics and Intelligent Laboratory Systems9 and Journal of Chemometrics10 that chemometricians may die out. Other subjects such as metabolomics are being spawned, whilst traditional univariate statistics in analytical chemistry has remained largely unchanged over many decades. There is a dangerous trend by which the number of specialist chemometricians is declining slowly but surely over the last few decades, whereas more and more people, especially in industry, will demand the use of modern multivariate packages for their work, so there will be less and less with expert knowledge, with more and more users. To an extent in environments where the level of expertise required is modest, then the part-time chemometrician poses no harm. By analogy, organic chemists need to interpret their data using NMR, but an average synthetic chemist is not an expert NMR spectroscopist and will not have much knowledge of the theory of the technique or the electronics of the instrumentation, and would never be able to understand fundamental papers such as those of Richard Ernst.11 But there still is a healthy supply of NMR specialists, who do nothing but NMR and keep the discipline healthy. In chemometrics, in contrast, the specialist is in danger of dying out, with great dangers especially for industry.
Chemometrics has grown and then the bubble has burst. Unlike disciplines such as bioinformatics, medical statistics, econometrics or even chemoinformatics, there are relatively few specialist chemometricians, who are needed to maintain the health of the subject into the future. The failure of chemometrics to survive as an expanding academic discipline is most evident in countries with a financially based academic system such as the USA or UK, where workers in universities are judged primarily by the amount of money they can raise. Pure chemometrics still exists and in fact is growing in countries where academic scientists can still pursue research for the sake of it, without worrying too much about funding, and are primarily judged by productivity in terms of publications, such as Iran or certain Eastern European countries or China, but as their funding mechanism evolves in the future so probably will pure chemometrics die too in these countries. The torch is being passed from North America and Northern Europe, through Southern Europe, Eastern Europe and then to Asia, as the academic model changes to one where scientists are largely judged by the level of funds they can attract. In contrast this model is helping the growth of areas such as bioinformatics or econometrics.
In order to understand this, we need to study the changes in chemometrics since the 1970s.
In the 1980s, chemometrics experts got organised, with the first main conferences, journals, books and workshops. The word chemometrics, from a small seed, became more widespread. But this is where the danger started.
Most of the pioneers were rapidly able to establish large groups in the 1980s. A person's academic importance in a subject such as chemistry is primarily related to how many students and co-workers they can attract, which then catalyses papers and conference presentations and so international profile. In order to develop a large group, money is needed. In mainstream scientific disciplines such as synthetic chemistry this mainly comes via traditional routes: committees are well established and departments “boxed” in so it is quite easy for a newcomer to climb the pole as he or she is pigeon-holed at an early stage and knows where to attract funds and collaborators. For a subject like chemometrics, there were few such opportunities, so money came primarily from industry. Some of the early achievements, notably in NIR were in areas where there was a definable industrial need, first in the petrochemical and food industries and then in the pharmaceutical industry. People needed chemometrics approaches and so it was possible for a small number of pioneering individuals, to obtain funding and good collaborations. These people often were left alone by their employees, so long as money flowed in. But departments and academic disciplines are not funded on short term industrial grants, just individuals. There were many other people working in this area, many very good, but perhaps not so good at attracting funding: these latter often had to focus their work in other areas, and just do chemometrics part-time as a hobby. So although there were not too many focussed chemometrics experts around at the time, the few leading individuals could call themselves chemometricians, but others were primarily analytical or physical or organic or environmental or pharmaceutical chemists or whatever, trying to squeeze a little bit of chemometrics into their work.
As time went on, in the 1980s onwards, computers became more user friendly and so chemometrics methods, once the preserve of the technical expert who could program a mainstream and understand complex algorithms, gradually became incorporated into packaged software. Many of the most widespread packages have their origins in that period, such as SIMCA, PLS Toolbox, Unscrambler and Pirouette. As more people heard the word chemometrics, more wanted to use the more widely available methods. But these new people did not have the time to learn about the computational and statistical basis of chemometrics methods, they were not programmers or experts in maths or computer science. They wanted to use the methods “immediately”. If a busy worker in industry decided he or she wanted to do some pattern recognition, they would not have the time to take 2 or 3 years out to learn the scientific basis of their approaches. A salesman will come up to them, and tell them that if they spend $1000 on a package and attend a one day course on PCA (or whatever) they would be sufficiently expert to master the discipline, this is what they want to hear and they will listen to the easiest option. Given the choice of paying for a package and one or two days' training or giving their employees leave for 3 or 4 years off to pursue a PhD in chemometrics, managers in industry will infallibly opt for the first solution.
So as computers became easier to use, and as more people wanted to use multivariate methods, more companies offered an “easy learning” route to chemometrics expertise, via sale of a package and a day's training. If an industry wanted better expertise, they might hire a specialist, but only large companies would have sufficient business and funds to hire a dedicated chemometrics expert. In the early days of the 1990s, there was less demand for such expertise so the market was satisfied with the existing pool. However, industry is often very reluctant to fund long term infrastructure or departments. Usually they will expect someone else to train their personnel in dedicated PhD or masters courses (of which there are a few but certainly not many world-wide) and pay for it, rather than pay a university a long term and substantive grant for establishing a chemometrics department. They may pay for a single focussed project, or into a consortium, but that will not solve the long term infrastructure problem. Hence chemometrics in academia was very fragile, based around a few individuals, whose key ability was the ability to obtain funds.
The evolution of chemometrics in academia was quite different, for example, to bioinformatics. Whereas industry certainly has a need of bioinformatics, it also can attract significant public funding, and hence became a basis of solid academic departments and groups. This is partly because medicine and genomics is regarded as for the public good, so a pure research base can be established, and therefore is not dependent on primarily on short term focussed industrial funding.
There are and continue to be major barriers for chemometrics to obtain academic respectability and significant public funding. One problem is the rather convoluted relationship between chemometricians and academic statisticians. University statistics departments tend to be applied maths departments but academic statisticians will rigorously defend their patch. Many would like or would regard chemometrics as their own area, but in practice extremely few statisticians actually contribute to chemometrics journals or conferences. There should, however, be a difference in orientation between statisticians and chemometricians. The latter often have to deal with messy data, not necessarily of the quality, size and design expected by a statistician, and have to develop safe approaches to handle these datasets. Chemometricians primarily focus on multivariate methods, whereas most statisticians are trained in experimental design, ANOVA and regression: many of the well designed datasets a professional statistician is trained to handle simply do not and cannot exist in chemometrics. Finally a chemometrics expert will also understand the problems of instrumental data, for example how to resolve GC-MS or to correct artefacts in NIR or to pre-process NMR and will balance this need against the later steps. Many statisticians in fact do not encounter these sort of problems, for example in a survey there is no preprocessing or instrumental problem, so the data can be directly analysed using statistical methods, and therefore will rarely take the problems of analytical instrumentation into account in their data analysis strategy, often assuming that “the chemist has done this already”.
Of course no one can deny the importance of sound statistical thinking, but this is really just one of several inputs into the chemometrician's tool-box, and the two groups of people are related but not the same (although it is equally a mistake to ignore the sound statistical basis of multivariate methods). Many statisticians do not wish to acknowledge this, and particularly in academia, chemometrics is regarded as second class statistics rather than first class chemometrics. Hence the discipline has had significant problems establishing itself as a reputable stand-alone academic area for research for mathematically based scientists.
Another problem is that many chemists like to do the data analysis in their own groups. In some ways this is a consequence of poor planning, often when there is a project with data explosion, chemists tend to underestimate the needs and only come to the conclusion towards the end of the project, when they have put insufficient resources into this aspect, so there are not enough funds available to hire dedicated experts. A second problem is that many chemists feel themselves master of all things, and that they or their students can pick up methods part-time to assist their projects. If all that is required is a factorial design or a PC plot, this may be so, and for the sort of relatively small problems studied a couple of decades ago, there often was not much data. But many modern problems often result in huge amounts of instrumental data and require significant data analysis expertise. Consider genomics, where a good database can be mined for many years or decades afterwards, chemists also have the capacity to generate enormous quantities of data. But the pressure of the funding bodies and PhD awarding bodies is to “write up” the results in what is often quite a short time period, and then the investigator, in order to get more funds, has to move on, meaning that often huge repositories of data are left largely untouched. If a chemist applies for funds just to “analyse their data” after the project, the average funding body will believe that the project was unfinished and usually will not grant this. So within chemistry there has developed the cult of the amateur who produces a PCA plot on a Friday afternoon a few weeks or days before the project finishes and then moves on. Dedicated chemometrics expertise is rarely written into chemistry projects, often because it is viewed as expensive and “data analysis is something to be done in a few days at the end of the work” and is viewed as a sideline “anyone can do”. In areas such as synthesis, it is quite normal for investigators to have a wide range of skills, such as MS, NMR, chromatography and so on, but not normally at a very high level, but what also is needed is a specialist researcher that can pioneer and develop new techniques that then are filtered down to the chemist. Some academic chemists, if they do encounter a very tricky statistical problem, will go down to their statistics department, because many universities tend to think in boxes, rather than go to in-house chemometrics experts. This cuts out the chemometrician, and therefore provides less food for chemometrics to grow. However statisticians, whilst certainly providing an essential fundamental basis for many of the multivariate methods used by chemometricians, will not always have strong grasp of the underlying chemistry—they may not recognise if there were problems tuning the GC-MS or with reproducibility of the mass spectra, or solvent shifts in the NMR spectrum, and often will want to use their own methods rather than spend time on messy cleaning up of analytical data and may fail to look at problems holistically.
Hence there has been very limited success in developing chemometrics as a firm independent scientific discipline. Industry is reluctant to fund long term stable departments. Academic statisticians view chemometricians as second class. Chemists want to do the work themselves.
Chemometrics still does exist though, but partly due to a small number of front line figures. In countries where funding is less important for the scientific establishment, chemometrics for the sake of it is expanding, but if their funding model changes, it may well retreat. And many who would previously have viewed themselves as chemometricians have found more financial success in related areas such as metabolomics or chemical engineering or medicine, building on their experience in chemometrics which is viewed more as a phase or stepping stone into more lucrative areas. Many very prominent leaders have jumped ship with great success into other disciplines when they realised the poor potential for core chemometrics.
Yet in parallel to this disappointing evolution of the core subject, there has been a widespread growth in the use of packages, but this has created a very dangerous situation. Very few users of these packages understand very basic principles. It is very easy to over-fit and over-interpret models. It is easy to obtain, for example, a separation, using PLS methods even from random data12 yet many scientists like these over-fitted models as they suggest that their experiments are better and more successful than they really are and referees of papers or managers of projects or funding bodies prefer unrealistic and over-optimistic summaries. The majority will want to publish a PLS-DA scores plot that shows an apparently good separation between two groups rather than a PCA scores plot that does not. Very few users of packages understand the difference between optimisation and validation and often, for example, use the results of cross-validation to simultaneously find the number of components in a model and to validate it. Go to a chemometrics conference and ask individuals if they understand the relationship between the normal distribution, F distribution, t distribution and χ2 distribution, they basically all are based on the underlying assumption that errors are normally distributed, but probably very few even in such a specialist conference could answer, despite the fact that many will be using such distributions for assessing models on a regular basis. There are numerous papers using PLS-DA but how many users understand the assumptions behind the method, and how many realise that the performance of such a method really depends on a lot of decisions (how to scale the data, what cut-off threshold to use for a decision function, when to use PLS1-DA rather than PLS2-DA) rather than the algorithm itself? Many people, even quite regular authors of chemometrics papers, claim that method A is better than method B, but what they are really saying is that by using a specific approach for validation and maybe 10 steps in the data analysis, one of which contains an algorithm such as PLS, they can get a higher accuracy or low errors rate for the series of steps containing method A than method B; and usually they fail to describe this series of steps in sufficient detail for their work to be of any general validity at all. What proportion of users of chemometrics methods understand that most statistics from methods such as SIMCA13 are based on the assumption that the underlying distribution of the data is normal? Have they tested their data for normality or transformed it to fall approximately into a normal distribution? Do they understand that cut-off thresholds such as 99% are very critically dependent on having a good model of the underlying distribution? In 2006 the journal Trends in Analytical Chemistry published a special issue on common misconceptions and errors in chemometrics.14
How to solve this serious problem is a very difficult question. With a rapid expansion of chemometrics packages but a slow decline in dedicated chemometrics expertise, this gap is widening dramatically. Many papers published using chemometrics software are misleading and in many cases wrong. The misuse of statistics in biology has always been a serious problem and it is still acknowledged that the vast proportion of biology papers contain serious problems with statistical interpretation of their data15 where David Vaux states “… it is still common to find papers in most biology journals … contain … basic statistical errors. In my opinion, the fact that these scientifically sloppy papers continue to be published means that the authors, reviewers and editors cannot comprehend the statistics, that they have not read the paper carefully, or both. Why does this happen? Most cell and molecular biologists are taught some statistics during their high-school or undergraduate years, but the principles seem to be forgotten somewhere between graduation and starting in the lab. Often, the type of statistics they learnt is not relevant to the kinds of experiment they are now doing. And, once in the lab, people generally just do what everyone else does, without always understanding why”. The misuse of chemometrics is likewise causing difficulties. Perhaps one hundred journals, or even more, now publish papers with some so-called chemometrics content. There are not enough people with enough time to referee these papers for detailed chemometric sense, so there are fewer expert reviews of this aspect of the work. This means that a significant portion of papers using multivariate methods in chemistry is dubious, but there is little that can be done. Even if one or two journals strongly up their standards (and if they do, who will police it, as the pool of chemometrics experts is not particularly great), there will always be others, some quite respectable with high impact. Indeed many applied journals tend to have higher impact compared to theoretical journals, so there is an impetus to publish papers with sloppy chemometrics in high impact journals, compared to rigorous ones in lower impact journals.
So the future of chemometrics as a discipline looks bleak. The continual growth of packaged software and greater awareness of, for example, the need for basic experimental design, of course will be assured, but in the absence of a co-ordinated academic base, chemometrics will continue to be led by a small pool of lead individuals, too few to actively service the needs of the international community, many moving into other more lush pastures. There of course are still tremendous intellectual challenges. The ability to acquire huge quantities of data using modern instrumentation and modern computing power, means that methods from machine learning such as support vector machines16 and self organising maps17 have enormous potential. The combination of traditional statistical methods with multivariate methods, for example ASCA18 and multilevel methods19 shows promise. There are huge possibilities for data mining. And the development of new disciplines such as metabolomics allows a pathway for chemometrics experts to receive significant funding for development in their own rights.
But the core discipline of chemometrics for its own sake is likely to stay small or contract in the near future, and unlike medical statistics or econometrics or bioinformatics, is unlikely to result in a significant number of departments or be recognised as a core academic discipline, and will continue to be dominated by a small number of lead individuals many of whom are good at obtaining funds. It is not clear whether the tremendous potential of chemometrics in this new age of large analytical datasets and cost effective computer power, will be realised, although there will be always good theoretical workers that will apply their skills in statistics or metabolomics or medicine or machine learning. It is quite common for promising fields to be kicked into the long grass, some remaining dormant for decades or more, some never to emerge again and others to be rediscovered by new generations far into the future. Who knows what will happen a century from now? Chemometrics has existed in some form for around half a century but whether it will die out now is a matter only future generations will be able to answer. It was well known that mammals survived the age of dinosaurs, to emerge dominant after a mass extinction. If in the future an important avenue of highly funded research is shown to result in various dangerous conclusions that could result in major problems for health and safety, based on less than rigorous data analysis, it may be that the bubble of research for the sake of huge dollar signs will become less reputable and smaller semi-dormant but very rigorous disciplines like chemometrics will emerge from the woodwork, in decades to come, with the survivors inheriting the earth. But for this to happen would require a catastrophic event to the present system of funding research and funding scientists.
This journal is © The Royal Society of Chemistry 2013 |