TY - JOUR
T1 - No one tool to rule them all
T2 - Prokaryotic gene prediction tool annotations are highly dependent on the organism of study
AU - Dimonaco, Nicholas J
AU - Aubrey, Wayne
AU - Kenobi, Kim
AU - Clare, Amanda
AU - Creevey, Christopher J
N1 - Funding
This work was supported by an Institute of Biological, Environmental and Rural Sciences Aberystwyth PhD fellowship (to N.J.D.). C.J.C. wishes to acknowledge funding from the Biotechnology and Biological Sciences Research Council (BB/E/W/10964A01, BBS/OS/GC/000011B); Department of Agriculture, Food and the Marine Ireland/DAERA Northern Ireland (Meth-Abate, R3192GFS); and the European Commission via Horizon 2020 (818368, MASTER).
Conflict of Interest: none declared.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - MotivationThe biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis.ResultsWe present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.
AB - MotivationThe biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis.ResultsWe present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.
UR - http://www.scopus.com/inward/record.url?scp=85125463780&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btab827
DO - 10.1093/bioinformatics/btab827
M3 - Article
C2 - 34875010
SN - 1367-4803
VL - 38
SP - 1198
EP - 1207
JO - Bioinformatics
JF - Bioinformatics
IS - 5
M1 - btab827
ER -