Created by W.Langdon from gp-bibliography.bib Revision:1.8168
This thesis investigates different aspects of alternative splicing in humans, based upon computational large-scale analyses. We introduce a genetic programming approach to predict alternative splicing events without using expressed sequence tags (ESTs). In contrast to existing methods, our approach relies on sequence information only, and is therefore independent of the existence of orthologous sequences.
We analysed 27,519 constitutively spliced and 9,641 cassette exons (SCE) together with their neighbouring introns; in addition we analyzed 33,316 constitutively spliced introns and 2,712 retained introns (SIR). We find that our tool for classifying yields highly accurate predictions on the SIR data, with a sensitivity of 92.1percent and a specificity of 79.2percent. Prediction accuracies on the SCE data are lower: 47.3percent (sensitivity) and 70.9percent (specificity), indicating that alternative splicing of introns can be better captured by sequence properties than that of exons.
We critically question these findings and in particular discuss the huge impact of the feature 'length' on predictions in retained introns. We find that the number of adenosines in an exon, called 'feature A' is a highly prominent feature for classification of exons. Adenosines are especially overrepresented in the most abundant exonic splicing enhancers, found in constitutive exons. Furthermore we comment on inconsistencies of the nomenclature and on problems of handling the splicing data. We make suggestions to improve the terminology.
For further in silico exploration of sequence properties of exons, we generated a dataset of synthetic exons. We describe a general rule for creating sequences with similar exonic splicing enhancer and -silencer densities to real exons, as well as similar exonic splicing enhancer networks. We find that exonic splicing enhancer densities are well suited for differentiating real and randomised exons, whereas the densities of SR protein binding sites are largely uninformative. Generally, we find that features described on small scale experimental data are not transferable to computational large-scale analyses, which makes creation of rules for alternative splicing prediction based only upon DNA/RNA sequence, an extraordinarily difficult task.
According to our findings, we suggest that in case of the SCE, only 20percent, and in case of SIR, only 30percent of the whole splicing information is encoded on sequence level.
In the last chapter we investigated the question whether alternative splicing may be connected to adaptive evolutionary processes in a species or population. Unfortunately, the currently available population genetic tools are not sensitive enough to identify traces of positive or balancing selection on the scale of a few 100bp. Additional problems are the incomplete SNP databases and SNP ascertainment bias. The evolutionary role of alternative splicing remains, at least for the moment, speculative.",
Genetic Programming entries for Ivana Vukusic