Abstract
This thesis documents the investigation into the acquisition of knowledge from biological data using computational methods for the discovery of significantly frequent patterns in gene location and phylogeny.Beginning with an initial statistical analysis of distribution of gene locations in the flowering plant Arabidopsis thaliana, we discover unexplained elements of order. The second area of this research looks into frequent patterns in the single dimensional linear structure of the physical locations of genes on the genome of Saccharomyces cerevisiae. This is an area of epigenetics which has, hitherto, attracted little attention. The frequent patterns are patterns of structure represented in Datalog, suitable for analyses using the logic programming methodology Prolog. This is used to find patterns in gene location with respect to various gene attributes such as molecular function and the distance between genes. Here we find significant frequent patterns in neighbouring pairs of genes. We also discover very significant patterns in the molecular function of genes separated by distances of between 5,000 and 20,000 base pairs. However, in complete contrast to the latter result, we find that the distribution of genes of molecular function within a local region of ±20, 000 base pairs is locationally independent.
In the second part of this research we look for significantly frequent patterns of phylogenetic subtrees in a broad database of phylogenetic trees. Here we investigate the use of two types of frequent phylogenetic structures. Firstly, phylogenetic pairs are used to determine relationships between organisms. Secondly, phylogenetic triple structures are used to represent subtrees. Frequent subtree mining is then used to establish phylogenetic relationships with a high confidence between a small set of organisms. This exercise was invaluable to enable these procedures to be extended in future to encompass much larger sets of organisms.
This research has revealed effective methods for the analysis of, and has discovered patterns of order in the locations of genes within genomes. Research into phylogenetic tree generation based on protein structure has discovered the requirements for an effective method to extract elements of phylogenetic information from a phylogenetic database and reconstruct a single consensus tree from that information. In this way it should be possible to produce a species tree of life with high degree of confidence and resolution.
Date of Award | 07 Apr 2009 |
---|---|
Original language | English |
Awarding Institution |
|
Sponsors | Engineering and Physical Sciences Research Council |
Supervisor | Ross Donald King (Supervisor) & Amanda Clare (Supervisor) |
Keywords
- biological data
- statistical analysis