Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being developed, since experimental assessment of variation effects is often not feasible. Benchmark datasets are needed for the development and for performance assessment of such prediction methods.
We studied quality related aspects of variant databases and benchmark datasets. The online tool called VariOtator was developed to aid in the consistent use of the Variation Ontology, which was specifically developed to describe variation. Standardization is one aspect of database quality; the use of an ontology for variant annotation will contribute to the enhancement of it.
BTKbase is a locus-specific database containing information on variants in BTK, the gene involved in X-linked agammaglobulinemia (XLA), a primary immunodeficiency. If available, phenotypic data, i.e. the variant effects, are also provided. Statistics on variants and variation types showed that there is a wide spectrum of variants and variation types, and that the distribution of protein variants in the different BTK domains is not even.
The VariSNP database containing datasets with neutral (non-pathogenic) variants was generated by selecting variants from dbSNP and filtering for variants found in the ClinVar, PhenCode and SwissProt databases. Variants in these three databases are considered to be disease-related. The VariSNP database contains 13 datasets following the functional classification of dbSNP, and is updated on a regular basis.
To study the sensitivity to variation in different protein and disease groups, we predicted the pathogenicity of all possible single amino acid substitutions (SAASs) in all proteins in these groups, using the well-performing prediction method PON P2. Large differences in the proportions of harmful, benign and unknown variants were found, and distinctive patterns of SAAS types were found, both in the original and variant amino acids.
Representativeness is one quality aspect of variation benchmark datasets, and relates to the representation of the space of variants and their effects. We studied the coverage and distribution of protein features, including structure (CATH) and enzyme classification (EC), Pfam domains and Gene Ontology terms, in established benchmark datasets. None of the datasets is fully representative. Coverage of the features is in general better in the larger datasets, and better in the neutral datasets. At the higher levels of the CATH and EC classifications, all datasets were unbiased, but for the lower levels and other features, all datasets were biased.
- Department of Experimental Medical Science
- Vihinen, Mauno, Supervisor
- Höglund, Mattias, Supervisor
|Award date||2017 Sep 29|
|Place of Publication||Lund|
|Publication status||Published - 2017|
Place: Dora Jacobsohn, BMC D15, Klinikgatan 32, Lund
Name: Keller, Andreas
Affiliation: Saarland University, Saarbrücken
Lund University, Faculty of Medicine Doctoral Dissertation Series 2017:132
- Annotation, genetic variation, benchmarks, databases, disease groups, pathogenicity, proteins, representativeness, sensitivity, variant effect analysis