Multi-relational Association Mining Software for Genome Wide Association Studies

Project: Externally funded research

Project Details

Description

This proposal will produce association mining software for genome wide association studies. The software will find multi-relational associations, that is, it will be able to work with data expressed as relations spanning multiple database tables, or expressed as first order predicate logic. In this way we will be able to make use of not just simple marker variations and a basic phenotype, but complex structured phenotype data, information about parental genotype and phenotype, environmental data, information about sequence similarity, geography, longitudinal data and other data as required. The software will be based on high performance data structures (inverted indices and data compression) to provide an effective solution for large data that cannot easily be handled by existing algorithms. The software will be open source and documented. Aberystwyth University has a world-leading breeding program for the bioenergy crop, Miscanthus, with a collection of several thousand accessions. We will apply the software to the Miscanthus case study in Aberystwyth. We are currently obtaining genotype data for these collections, and these data will provide an excellent real-world application for the software.

Layman's description

We are now in the genomic age, and have a variety of technologies available to tell us about the specific genomes of the organisms we work with. In humans we would like to know about the genes associated with disease or ageing, so that we can more effectively target drugs or engineer vaccines. In plants we would like to know about the genes associated with resistance to drought, tolerance of stress, the production of seed and the density of growth, so that we can breed better crops that will be more suitable to future climates and demands for food and fuel production. Full genome sequencing to discover this genomic information for large populations is still expensive, but the sequencing of the DNA at certain marker locations is now reasonably affordable and technologically possible. For humans we can use more than a million markers to determine their genetic makeup at these locations in their genome. The question then, for humans, crops or other organisms is how to relate this genotype to the disease/health/drought resistance/seed production they show (the phenotype). We need to find associations between genotype and phenotype. Association mining is a data mining technique commonly used by academic and industrial data mining experts to find frequent associations. It is used by commercial retailers to suggest other products that a customer might also like to buy, given the history of frequently associated purchases made by others. This technology should be ideal for finding genotype/phenotype associations. However, for this problem we have more complex data than the standard algorithms can process. The standard algorithms will only work for 'single table' data: data that could be represented in a single matrix of rows and columns. As soon as we want to specify more interesting relationships we need a more powerful representation for the data and for the association. For this we need to use first order predicate logic. 'First order' refers to the ability to use variables to represent relationships between the parts of the association (rather than just constants). The predicates describe those relationships. An example of such as relationship, is: if genotype(Plant, ssr_1057, long, confident) and genotype(Plant, ssr_369, short, confident) and parent(Plant, Parent) and location(Parent, thailand) then dense_stems are 70% likely We will produce software that can find multi-relational associations such as these in large amounts of complex data. We will apply this software to standard test data, and to a case study at Aberystwyth, for analysis of our population of the bio-energy crop Miscanthus. We will release the software as open source, with documentation and tutorials for the biological community to use.
StatusFinished
Effective start/end date01 Jul 201230 Sept 2013

Funding

  • Biotechnology and Biological Sciences Research Council (Funder reference unknown): £105,178.56

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.