ORFs, StORFs and Pseudogenes
: Uncovering Novel Genomic Knowledge in Prokaryotic and Viral Genomes

  • Nick Dimonaco

Student thesis: Doctoral ThesisDoctor of Philosophy


Often viewed as simplistic when compared to eukaryotic genomics, the multifaceted processes behind prokaryotic genome annotation have been re-evaluated in this thesis. In order to undertake this re-evaluation, both contemporary and novel methods of characterising prokaryotic genomic data were thoroughly investigated and developed to further our understanding of these organisms.

In Chapter 2, historic and contemporary prokaryotic genome annotation techniques are evaluated via the development of a novel genome annotation comparison and improvement platform, ORForise (https://github.com/NickJD/ORForise). The results of ORForise outlined that these techniques are effective at identifying genes that are similar to those in existing genomic databases. However, there are two key findings in Chapter 2 which point to notable inadequacies. Firstly, no single annotation tool performed best for all genomes studied, with the type of gene and organism being annotated being the most important criteria when choosing a genome annotation tool. Secondly, taking into account many of the limitations consistent among the annotation tools considered in this study, there were an unexpected number of large regions of each genome which were consistently labelled as‘intergenic’ or without annotation.

In Chapter 3, a thorough investigation of many of the specific weaknesses identified in the annotation tools from Chapter 2 was performed. This resulted in the identification of a set of full-length CDS gene sequences in these ‘intergenic’ regions which formed part of known and novel core and soft-core gene families in the E. coli pangenome. Additionally, a large number of highly conserved gene families were found in ‘intergenic regions’ across multiple genera. This adds evidence to the contention that regions of DNA labelled as ’intergenic’ by existing annotation tools contain real genes and, as such, these regions were renamed ‘unannotated regions’. This discovery and the redefinition of ‘intergenic’ regions was possible via the development and modification of two novel techniques and software platforms, ORForise and StORF-Reporter (https://github.com/NickJD/StORF-Reporter). These allowed for the extraction of additional and novel genomic information from existing genomic databases. Additionally, as StORF-Reporter found a number of putative CDS gene fragments in these unannotated regions, Chapter 4, focuses on the absence of pseudogenes in canonical genome annotations. This uncovered thousands of potential pseudogenised and functional genes that were missed by annotation tools due to either terminating in-frame stop codons or alternative use of stop codons to code for amino acids. The results from Chapters 3 and 4 have led to the redefinition of not only the gene collection of the E. coli pangenome and many of the studied genera, but also may impact our understanding of their phylogeny.

To enable the discoveries and analysis in Chapters 2, 3 and 4, a number of passive computational approaches (those which can only operate alongside a rigid set of predefined rules) were used and developed. The majority of rules or parameters in these approaches were tuned through a thorough investigation of genomic features identified manually. However, biology as a domain has too many exceptions and too many rules for a passive computational approach to be universally tractable. The scale of this problem is only matched by the genomic data that we now have available. To overcome this, machine learning methodologies were investigated during a research visit to King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. Specifically, the growing affinity between machine learning and biology was investigated in Chapter 5 with a novel neural network algorithm named FrameRate (https://github.com/NickJD/FrameRate). FrameRate was developed to offer insight into the coding potential of unassembled DNA sequences without the need for sequence homology or assembly.

Lastly, at the beginning of the current SARS-CoV-2 pandemic, an opportunity was presented to apply the skills and knowledge gained throughout the development of this thesis to the novel SARS-CoV-2 genome. Chapter 6, describes how a novel hybrid genome annotation approach which combined ab initio gene prediction and sequence alignment techniques was developed and used to annotate coronavirus genomes found in human, bat and pangolin hosts. Additionally, unlike other contemporary gene prediction tools, StORF-Reporter was able to identify the enigmatic ORF10 Open Reading Frame in SARS-CoV-2 without sequence alignment or RNA-Seq analysis.
Date of Award2022
Original languageEnglish
Awarding Institution
  • Aberystwyth University
SupervisorChris Creevey (Supervisor), Amanda Clare (Supervisor), Wayne Aubrey (Supervisor), Kim Kenobi (Supervisor), Robert Hoehndorf (Supervisor) & Arwyn Edwards (Supervisor)

Cite this