TY - JOUR
T1 - Enhancement of plant metabolite fingerprinting by machine learning
AU - Scott, Ian Morris
AU - Vermeer, Cornelia P.
AU - Liakata, Maria
AU - Corol, Delia I.
AU - Ward, Jane L.
AU - Lin, Wanchang
AU - Johnson, Helen Elisabeth
AU - Whitehead, Lynne
AU - Kular, Baldeep
AU - Baker, John M.
AU - Walsh, Sean
AU - Dave, Anuja
AU - Larson, Tony R.
AU - Graham, Ian A.
AU - Wang, Trevor L.
AU - King, Ross Donald
AU - Draper, John
AU - Beale, Michael H.
N1 - IMPF: 06.45
PY - 2010/8
Y1 - 2010/8
N2 - Metabolite fingerprinting of Arabidopsis (Arabidopsis thaliana) mutants with known or predicted metabolic lesions was performed by 1H-nuclear magnetic resonance, Fourier transform infrared, and flow injection electrospray-mass spectrometry. Fingerprinting enabled processing of five times more plants than conventional chromatographic profiling and was competitive for discriminating mutants, other than those affected in only low-abundance metabolites. Despite their rapidity and complexity, fingerprints yielded metabolomic insights (e.g. that effects of single lesions were usually not confined to individual pathways). Among fingerprint techniques, 1H-nuclear magnetic resonance discriminated the most mutant phenotypes from the wild type and Fourier transform infrared discriminated the fewest. To maximize information from fingerprints, data analysis was crucial. One-third of distinctive phenotypes might have been overlooked had data models been confined to principal component analysis score plots. Among several methods tested, machine learning (ML) algorithms, namely support vector machine or random forest (RF) classifiers, were unsurpassed for phenotype discrimination. Support vector machines were often the best performing classifiers, but RFs yielded some particularly informative measures. First, RFs estimated margins between mutant phenotypes, whose relations could then be visualized by Sammon mapping or hierarchical clustering. Second, RFs provided importance scores for the features within fingerprints that discriminated mutants. These scores correlated with analysis of variance F values (as did Kruskal-Wallis tests, true- and false-positive measures, mutual information, and the Relief feature selection algorithm). ML classifiers, as models trained on one data set to predict another, were ideal for focused metabolomic queries, such as the distinctiveness and consistency of mutant phenotypes. Accessible software for use of ML in plant physiology is highlighted.
AB - Metabolite fingerprinting of Arabidopsis (Arabidopsis thaliana) mutants with known or predicted metabolic lesions was performed by 1H-nuclear magnetic resonance, Fourier transform infrared, and flow injection electrospray-mass spectrometry. Fingerprinting enabled processing of five times more plants than conventional chromatographic profiling and was competitive for discriminating mutants, other than those affected in only low-abundance metabolites. Despite their rapidity and complexity, fingerprints yielded metabolomic insights (e.g. that effects of single lesions were usually not confined to individual pathways). Among fingerprint techniques, 1H-nuclear magnetic resonance discriminated the most mutant phenotypes from the wild type and Fourier transform infrared discriminated the fewest. To maximize information from fingerprints, data analysis was crucial. One-third of distinctive phenotypes might have been overlooked had data models been confined to principal component analysis score plots. Among several methods tested, machine learning (ML) algorithms, namely support vector machine or random forest (RF) classifiers, were unsurpassed for phenotype discrimination. Support vector machines were often the best performing classifiers, but RFs yielded some particularly informative measures. First, RFs estimated margins between mutant phenotypes, whose relations could then be visualized by Sammon mapping or hierarchical clustering. Second, RFs provided importance scores for the features within fingerprints that discriminated mutants. These scores correlated with analysis of variance F values (as did Kruskal-Wallis tests, true- and false-positive measures, mutual information, and the Relief feature selection algorithm). ML classifiers, as models trained on one data set to predict another, were ideal for focused metabolomic queries, such as the distinctiveness and consistency of mutant phenotypes. Accessible software for use of ML in plant physiology is highlighted.
UR - http://www.scopus.com/inward/record.url?scp=77955681638&partnerID=8YFLogxK
U2 - 10.1104/pp.109.150524
DO - 10.1104/pp.109.150524
M3 - Article
C2 - 20566707
SN - 0032-0889
VL - 153
SP - 1506
EP - 1520
JO - Plant Physiology
JF - Plant Physiology
IS - 4
ER -