Methods in Integrative Genomics
Increasingly large-scale studies collect multiple different types of biological marker data ('omics' data) on the same set of people, enabling researchers to study different stages of disease in the same person. There is an increasing need for analysis methods capable of dealing with data from multiple platforms and biomarker types, which can account for the complex relations between molecular and other risk factors.
The Statistical Computing Section of the Royal Statistical Society and the Centre for Methodology at the London School of Hygiene and Tropical Medicine are holding a half day workshop on Tuesday 18th February on Methods in Integrative Genomics.
This workshop aims to provide a forum and spark interest in methods for analysing genomics and other molecular biology data in the context of epidemiological studies, in particular using computational statistical and machine learning methods for large data sets.
Programme, speakers & abstract
- 14.00 - 14.50: Manuela Zucknick (University of Oslo)
Multivariate structured Bayesian variable selection for treatment prediction in pharmacogenomic screens
Large-scale cancer pharmacogenomic screening experiments profile hundreds of cancer cell lines versus hundreds of clinically approved or experimental compounds. The aim of these in vitro studies is to use the genomic profiles of the cell lines together with information about the drugs to predict the response of individual cell lines to a particular drug or combination of drugs, and ultimately to learn about in vivo treatment response for patients.
This is a multi-task multi-view prediction problem, which can be addressed by high-dimensional multivariate regression, where the response variables are potentially highly correlated. We aim to improve prediction performance by combining the different genomic data sources efficiently in the input matrix, by borrowing information across drug response variables, and by using external knowledge about biological structures such as drug target pathways.
In this talk, Manuela will explore structured priors in multivariate Bayesian variable and covariance selection models to achieve this aim, in particular a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure, both between drug response variables and input variables from diverse genomic sources. This work is based on an efficient implementation of Bayesian seemingly unrelated regression by Banterle et al. (2018), where Markov chain Monte Carlo inference is made computationally feasible by factorisation of the covariance matrix amongst the response variables.
Banterle et al (2018). Sparse variable and covariance selection for high-dimensional seemingly unrelated Bayesian regression
About the speaker
Manuela Zucknick is Associate Professor at the Oslo Centre for Biostatistics and Epidemiology, University of Oslo. She is interested in exploring statistical methods for integrating different sources of genomic data and incorporating biological knowledge via structured penalties and priors, e.g. for modelling and prediction of treatment effects in anti-cancer drug screens.
- 14.50 - 15.20: Ernest Diez Benavente (LSHTM)
A molecular barcode to inform the geographical origin and transmission dynamics of Plasmodium vivax malaria
Although Plasmodium vivax parasites are the predominant cause of malaria outside of sub-Saharan Africa, they not always prioritised by elimination programmes. P. vivax is resilient and poses challenges through its ability to re-emerge from dormancy in the human liver. With observed growing drug-resistance and the increasing reports of life-threatening infections, new tools to inform elimination efforts are needed. In order to halt transmission, we need to better understand the dynamics of transmission, the movement of parasites, and the reservoirs of infection in order to design targeted interventions. The use of molecular genetics and epidemiology for tracking and studying malaria parasite populations has been applied successfully in P. falciparum species and here we sought to develop a molecular genetic tool for P. vivax. By assembling the largest set of P. vivax whole genome sequences (n=433) spanning 17 countries, and applying a machine learning approach, we created a 71 SNP barcode with high predictive ability to identify geographic origin (91.4%). Further, due to the inclusion of markers for within population variability, the barcode may also distinguish local transmission networks. By using P. vivax data from a low-transmission setting in Malaysia, we demonstrate the potential ability to infer outbreak events. By characterising the barcoding SNP genotypes in P. vivax DNA sourced from UK travellers (n=132) to ten malaria endemic countries predominantly not used in the barcode construction, we correctly predicted the geographic region of infection origin.
About the speaker
Dr Ernest Diez Benavente is a bioinformatician and biostatistician with a an interest on malaria genomics, he has worked on several projects looking at ways in which malaria parasite's genetic diversity can be used to inform the dynamics of parasite transmission and provide insight into the appearance of drug resistance. His research also aims to understand how parasite genetic diversity affects antimalarial vaccine response and inform vaccine development.
- 15.20 - 16.00: Coffee break
Venue: Pumphandle Bar
- 16.00 - 16.30: Ricard Argelaguet (European Bioinformatics Institute)
MOFA: a principled framework for the unsupervised integration of multi-omics data
The emergence of high-throughput technologies and the increasing availability of clinical data are radically changing the study of biology and its medical applications. In particular, the profiling of multi-omics from the same patient provides a unique opportunity to build statistical models to understand the molecular sources of patient heterogeneity. I will present Multi-Omics Factor Analysis (MOFA), a matrix factorisation framework for the comprehensive integration of multi-omics data. MOFA infers a set of latent factors which disentangle the sources of heterogeneity that are shared across multiple modalities from those specific to individual data modalities. The learnt factors enable a variety of downstream analyses, facilitating data interpretation and the construction of predictive models for clinical outcomes.
About the speaker
Ricard Argelaguet is a PhD student in Statistical Genomics at the European Bioinformatics Institute.
- 16.30 - 17.20: Paul Kirk (MRC Biostatistics Unit, Cambridge)
Integrative clustering approaches for multi-omics datasets
Using omics datasets to identify meaningful subgroups (whether of patients, genes, DNA motifs, or any other biological units) remains a key task in statistical omics and molecular medicine. The increasing availability of diverse omics datatypes presents challenges, as well as opportunities, for subgroup identification. Here we consider how Bayesian mixture modelling approaches can be used both for the identification of subgroups and the integration of multiple datatypes. We also discuss how we can assess (and attempt to maximise) the clinical relevance/meaningfulness of the clusters identified by mixture modelling.
About the speaker
Paul Kirk is a group leader within the MRC Biostatistics Unit and the Cambridge Institute of Therapeutic Immunology & Infectious Disease, and has previously held post-doctoral positions in Oxford, Imperial and Warwick. Having previously made contributions to the field of statistical systems biology, his current research is at the intersection of molecular precision medicine and statistical functional genomics. He is currently developing statistical and machine learning methods for the identification of clinically actionable disease subtypes.
Please note that this session will NOT be live-streamed/recorded.