CANSSI Ontario Research Day 2022

This is a free, hybrid event that welcomes attendees both online (via Zoom) and in person (at Chestnut Conference Centre, 89 Chestnut St, Toronto, ON).

Overview

The CANSSI Ontario Research Day will showcase the work and discoveries by the data sciences and statistics community in Ontario. This one-day event draws participants from many Ontario universities, public, not-for-profit, and the research sector for a full day of data and discoveries.

2022 Theme

There is much debate among practitioners and scholars about what data science is. Join us on September 29, 2022, where you will hear some of the top experts in the field engage in informed, insightful discussions as they unravel the definitions of data science.

Registration

Registration is free.

Virtual webinar link:

https://us06web.zoom.us/j/81789056933?pwd=ODBTVkFseDFxbkJWZGZGVHR1bGEzUT09
Passcode: 122701

Lunch, as well as drinks and snacks during the morning and afternoon breaks, will be provided.

Seating is available on a first come, first served basis, subject to venue capacity

Poster

2022 CANSSI Ontario Research Day Poster

Vote for best CANSSI Ontario Trainee presentation here:

Sponsors:


Hourly Schedule

September 29, 2022

9:45 am - 10:00 am
Welcoming Remarks
Speakers:
Lisa Strug
10:00 am - 11:00 am
Some Statistical Issues in Population, Clinical and Laboratory COVID-19 Research
The COVID-19 pandemic has stimulated intensive interdisciplinary collaborations aiming to advance understanding of the population dynamics of infection, develop vaccines, and identify effective therapeutic interventions. This talk will describe three public health, clinical and laboratory research projects in COVID-19 research with an emphasis on the statistical challenges and methodology. The projects will be presented in the order they arose over the course of the pandemic. Specific topics include reporting delay adjustments to provincial and national infection rates, rapid design and cost-effective execution of a clinical trial assessing the therapeutic effect of convalescent plasma, and evaluating the vaccine response to individuals with autoimmune disease. Understanding the data acquisition and reporting processes is shown to be critically important, highlighting the need for careful planning as well as a good public health infrastructure for population research.
Speakers:
Richard Cook
11:00 am - 11:05 am
Break
11:05 am - 1:10 pm
Presentations by Ontario-based Researchers
Gengming He, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto

Develop novel statistical methods for analyzing long-read sequencing data to investigate the genetic mechanism of cystic fibrosis

David Li, Department of Statistical Sciences, University of Toronto.

A Poisson Cluster Process with Cluster-Dependent Marking for Detecting Ultra-Diffuse Galaxies

Boxi Lin, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto

Sex-stratified vs. sex-combined analysis in the presence of genetic effect heterogeneity.

YuChung Lin, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto

Incorporating Functional Annotations in Polygenic Risk Scores Improves Generalizability to Cross-Ethnic Populations

Samar Salah Mohamedahmed, Center for Addiction and Mental Health (CAMH), Pharmacogenetics Research Clinic

Genetic and Polygenic Risk Analysis of Antidepressant Response and Cognitive Domains in Late-Life Depression

Alina Selega, Lunenfeld-Tanenbaum Research Institute

Multi-objective Bayesian Optimization with Heuristic Objec- tives for Biomedical and Molecular Data Analysis Workflows

Divya Sharma, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto

Hybrid CNN-LSTM model for disease prediction using longitudinal microbial data

Teresa Tsui, Sunnybrook Research Institute

Accounting for uncertainty in health utilities to inform cancer drug funding decisions

Nicholas, Waglechner, Lunenfeld-Tanenbaum Research Institute, Sinai Health in Toronto

Genomic Epidemiology of Mycobacterium abscessus in a Canadian Cystic Fibrosis Centre

Changchang Xu, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto

Penalized maximum likelihood inference of mixture cure model under multiple imputation

Jingxiong Xu, Lunenfeld-Tanenbaum Research Institute

A Novel Gene-Based Test for Sequencing Studies Based on a Bayesian Variable Selection of Rare Variants

Ziang Zhang, Department of Statistical Sciences, University of Toronto.

An indirect test of gene-environment interaction for binary trait

Lehang Zhong, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto.

RoPE: a robust profile likelihood method for differential gene expression analysis

Junhao Zhu, Department of Statistical Sciences, University of Toronto.

LLOT in Reconstruction of Spatial Expression

Speakers:
Alina Selega, Boxi Lin, Changchang Xu, Dayi (David) Li, Divya Sharma, Gengming He, Jingxiong Xu, Junhao Zhu, Lehang Zhong, Samar Salah Mohamedahmed Elsheikh, Teresa Tsui, YuChung Lin, Ziang Zhang
1:10 - 2:00 pm
Poster Session | Lunch
2:00 - 3:15 pm
Panel Discussion: What is Data Science?
Moderated by:
Rohan Alexander, Assistant Director, CANSSI Ontario; Assistant Professor, Faculty of information and Department of Statistical Sciences, University of Toronto

Panellists:
Marsha Chechik, Professor, Department of Computer Science, University of Toronto

Mark Daley, Chief Digital Officer, Professor, Department of Computer Science, University of Western Ontario

Donald Estep, Director, CANSSI; Professor, Department of Statistics and Actuarial Science, Simon Fraser University

Amber Simpson, Professor, School of Computing and Department of Biomedical and Molecular Sciences, Queen’s University
3:15 - 3:30 pm
Transition to the Distinguished Lecture Series in Statistical Sciences (DLSS) with Dr. Xihong Lin
Registration for the DLSS is separate and available here
Speakers:
Xihong Lin
Lisa Strug
Lisa Strug
Director, CANSSI Ontario | Professor, Department of Statistical Sciences, Department of Computer Science, and Biostatistics Division, Dalla Lana School of Public Health, University of Toronto
Richard Cook
Richard Cook
University Professor and Mathematics Faculty Research Chair, Department of Statistical and Actuarial Sciences, Faculty of Mathematics; University Waterloo
Richard Cook is a Professor in the Department of Statistics and Actuarial Science, Faculty of Mathematics Research Chair, and University Professor at the University of Waterloo. He holds a cross-appointment in the School of Public Health (UW) and a part-time appointment in the Faculty of Health Science at McMaster University. His research interests include the analysis of life history data, the design and analysis of clinical and epidemiological studies, and statistical methods for the analysis of incomplete data. He has published extensively in these areas and written two books with Jerry Lawless. He is also deeply engaged in collaborative research with other scientists working in transfusion medicine, immunology, and cancer, and consults widely with industry and government organizations. In 2018 he was awarded the Gold Medal of the Statistical Society of Canada and in 2021 he was named a Fellow of the Royal Society of Canada.
Alina Selega
Alina Selega
CANSSI Ontario Top-up Awardee; Postdoctoral Fellow, Lunenfeld-Tanenbaum Research Institute
Multi-objective Bayesian Optimization with Heuristic Objectives for Biomedical and Molecular Data Analysis Workflows. Computational pipelines for the analysis of biomedical data often include many steps, each with many possible parameter settings that can affect downstream results. However, measures of pipeline success are often heuristic and therefore it is not known a priori which of the many objectives are useful. Thus, multi-objective Bayesian optimization (MOBO) methods that explore the set of possible solutions based on all objectives may be suboptimal in these cases. We developed a new MOBO method that guides optimization by selecting useful objectives based on how well they adhere to desirable criteria specified by the user. We apply our method in two real data scenarios: optimizing (i) a normalization parameter in a pipeline for imaging mass cytometry data and (ii) the percentage of highly variable genes in a pipeline for single-cell RNA-sequencing data.
Boxi Lin
Boxi Lin
CANSSI Ontario STAGE PhD Student, Division of Biostatistics, University of Toronto
Sex-stratified vs. sex-combined analysis in the presence of genetic effect heterogeneity. The effect of a genetic variant on a complex trait may differ between male and female, e.g. genetic effects may be sex-specific for testosterone levels. In the presence of genetic effect heterogeneity between female and male, sex-stratified analysis is often used, which provides easy-to-interpret sex-specific effect size estimates. However, from power of association testing perspective, sex-stratified analysis may not be the best approach. As sex-specific genetic effect implies SNP-sex interaction effect, jointly testing SNP main and SNP-sex interaction effects may be more powerful than sex-stratified analysis or the standard main-effect testing approach. When individual data are not available, it is then of interest to study if the interaction analysis can be derived from sex-stratified summary statistics. We considered several different sex-combined methods and evaluated them through extensive simulation studies. We observed that a) the joint SNP main and SNP-sex interaction analysis is most robust to a wide range of genetic models, and b) this joint interaction testing result can be obtained by quadratically combining sex-stratified summary statistics (i.e. squared sex-stratified summary statistics). We then provide theoretical justification on the equivalence between these joint interaction test and quadratically combined omnibus test. Finally, we provide additional supporting evidence by applying the methods to GWAS of testosterone levels in the UK Biobank data.
Changchang Xu
Changchang Xu
CANSSI Ontario STAGE PhD Student; PhD Candidate, Division of Biostatistics, University of Toronto
Penalized maximum likelihood inference of mixture cure model under multiple imputation. This presentation consists of six sections. I) Background and introduction: I-i) Mixture Cure (MC) model, specification and assumptions I-ii) Firth-type penalized likelihood (FT-PL) developed under MC framework I-iii) Motivating study cohort: an axillary lymph node negative (ANN) patient cohort data with missing values in tissue microarray molecular biomarkers; aim to evaluate molecular markers that may be indicative of differential prognosis of ANN breast cancer. I-iv) Research focus: improving parameter estimation and inference for the MC model in finite/sparse samples with missing values in the covariates II) Current research aim: developing profile likelihood-based inference under FT-PL, with incorporation into multiple imputation (MI) III) Methods: description including III-i) Profile likelihood confidence interval (PLCI) and likelihood ratio test (LRT) developed for MC model under FT-PL; III-ii) Basic procedure of MI by chained equation III-iii) Combined likelihood profile (CLIP) interval and profile likelihood-based LRT for MI. IV) Simulations to evaluate the performance of the proposed methods: IV-i) Simulation design and settings: distributions of biomarkers follow motivating study dataset; varying sample event recurrence rate (low to high); different assumptions of effect size/coefficients (null hypothesis/alternative hypothesis); different assumptions of missing data type (MCAR or MAR) IV-ii) Simulation analyses: IV-ii-1) Specification of a comprehensive imputation model; analysis model fitting under the data generating MC models IV-ii-2) Examining the validity of 1 df LRTs by type 1 error; evaluating the power of 1 df LRTs; comparing true value coverage between the proposed method (PLCI and CLIP) and the conventional method (Wald CI and Rubin's rule CI) V) Application in case study data of ANN breast cancer cohort: V-i) Comparing the proposed PLCI vs. conventional Wald CI method in the complete case data (missing subjects dropped by listwise deletion) under FT-PL and ML; V-ii) Comparing the proposed CLIP CI vs. conventional Rubin's rule CI with MI, under FT-PL and ML VI) Discussion and On-going work: VI-i) Whether/how different missing data type plays a role in the performance of proposed interval estimation methods VI-ii) Relevance of 2 df test and joint CI estimation for parameters associated with the same variable in MC model, and extending methods for bivariate parameter inference (work-in-progress).
Dayi (David) Li
Dayi (David) Li
CANSSI Ontario MDOC PhD Student, Department of Statistical Sciences, University of Toronto.
A Poisson Cluster Process with Cluster-Dependent Marking for Detecting Ultra-Diffuse Galaxies. Astrophysicists are interested in detecting a type of galaxies called ultra-diffuse galaxies (UDGs). Recently, a new method was proposed to detect UDGs by finding the clustering signals from their constituent globular clusters (GCs) using the log-Gaussian Cox process (LGCP). However, LGCP poses various detection issues such as the inability to determine the non-existence of UDGs; potential false-negative detection; and LGCP suffers when UDGs reside in noisy environments. We propose a novel Poisson cluster process model framework to address these issues. We construct an improved spatial birth-death-move MCMC algorithm to conduct inference. We fit our models to real data and show that our models are indeed much superior to LGCP at UDG detection.
Divya Sharma
Divya Sharma
CANSSI Ontario STAGE Postdoctoral Fellow; Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto
Hybrid CNN-LSTM model for disease prediction using longitudinal microbial data. In this presentation, I would discuss on the abilities of longitudinal microbiome data to predict risk of disease. In this research, we aim to explore the complex interactions within the microbiome and between taxa and clinical factors (e.g., gender, age, ethnicity) and also capture temporal information to understand the dynamic changes in microbiomes using advanced machine learning (Deep Learning) algorithms like Long Short Term Memory (LSTM) networks. We hypothesize that adding the information from past microbiome profiles will increase the predictive power of the microbiome data in comparison to training a model with microbiome data at each timepoint independently.
Gengming He
Gengming He
CANSSI Ontario STAGE PhD Student, Division of Biostatistics, University of Toronto
Develop novel statistical methods for analyzing long-read sequencing data to investigate the genetic mechanism of cystic fibrosis. Cystic fibrosis (CF) is a genetic disease caused by loss-of-function mutations on the cystic fibrosis transmembrane conductance regulator(CFTR)gene. Multiple organs can be affected by CF, causing CF-related comorbidities including diabetes, meconium ileusandlung disease. Genome-wide association studies (GWAS) identified gene loci beyond CFTR that associated with CF disease outcomes, refer to asCF disease modifier genes. Most GWAS studies of CF focus on individual single nucleotide polymorphism (SNP) but ignore complex structural variations (SV) and the combined effectofmultiplegenetic variants, which are known to contribute to disease outcomes. Multiple cis-acting variants could have distinct combined effect on CF disease outcomes when they locate on the same homologous chromosome (in cis), which cannot be detected by single SNP analysis like GWAS. SVs, including variable number tandem repeats (VNTR), were discovered in CF modifier gene loci and in linkage disequilibrium (LD) with genome-wide significant SNPs. However, the phenotypic association of SVs across the genome has yet to be investigated. Phase information is needed to be accounted for to answer these questions, which is not available in most studies due to the read-length limitation of current sequencing technologies. The goal of this study is to understand the complex genetic architecture of CF using the recently developed long-read sequencing technology. We aim to recruit and sequence DNA from 1000 individuals with CF using long-read technologies, the majority with Pacific Biosciences (PacBio) sequencing, which is capable to generate phased haplotypes with the length of a whole gene.Novel statistical methodsneeds to be tailored to analyze the long-read sequencing data. To detect cis-acting variants, we first introduce a cis-interaction matrix to code the phase information in a haplotype. Then a regularized matrix regression method is applied to select cis-acting variants. To test the genome-wide association of SVs, we propose a k-mer based regression approach without using a reference genome.A score statistics will be developed to test the association of SVs and SNPs while adjusting forthe call errorespecially in VNTRs. Simulation studies demonstrated these methods outperform the current approaches.
Jingxiong Xu
Jingxiong Xu
CANSSI Ontario STAGE Postdoctoral Fellow, Lunenfeld-Tanenbaum Research Institute
A Novel Gene-Based Test for Sequencing Studies Based on a Bayesian Variable Selection of Rare Variant. In human genetics, Next Generation Sequencing technology provides opportunities to discover rare variants associated with complex human diseases. A usual paradigm for detecting rare variants is to perform a gene-based association test, however, the inclusion of all rare variants in an association test might reduce its power, as the proportion of causal variants may be low. We propose a novel strategy that performs a variable selection of the rare variants to be considered in the gene-based test. Noninformative or annotation-based prior can be used.
Junhao Zhu
Junhao Zhu
CANSSI Ontario MDOC PhD student, Department of Statistical Sciences, University of Toronto.
LLOT in Reconstruction of Spatial Expression. The spatial expression pattern of cells is vital for inferring the heterogeneity of cells’ fate in complex tissue and understanding the tissue function. Although new experimental approaches have been applied to sequencing RNA at the single-cell resolution within the context of the tissues, it provides limited resolutions for expressions since only very few marker genes among thousands of genes are measured. Here, we introduce a method based on linear-model and Laplacian Optimal Transport to integrate spatial reference data and scRNA-seq data to study spatial courses of cells within a tissue. We apply the method to Drosophila scRNA-seq data and successfully reconstruct spatial gene-expression profiles in Drosophila early embryos. The results demonstrate the ability of our approach to provide a biologically interpretable framework for inferences and reconstructions about the spatial expression patterns of cells.
Lehang Zhong
Lehang Zhong
CANSSI Ontario STAGE PhD Student, Division of Biostatistics, University of Toronto
RoPE: a robust profile likelihood method for differential gene expression analysis. RNA-Sequencing (RNA-Seq) technology produces data in integer read counts to measure gene expression, which inherently contains large technical or biological variation that creates modeling challenges. Most existing differential gene expression (DE) analysis toolkits focused on inference in small sample sizes and assumed the over-dispersed data followed a negative binomial (NB) distribution. The NB assumption implies a quadratic mean-variance relationship, which may be too restrictive to represent the actual dispersion pattern genome-wide. Furthermore, several comparative studies have shown there is no one-size-fits-all analytic method. As sequencing technology decreases in price, RNA-seq experiments using larger sample sizes are becoming more accessible, and thus so are more robust tools that rely on large sample statistical properties. Here we develop RoPE, a novel method for DE analysis using the robust profile likelihood under both the frequentist and evidential paradigm that features: (1) accurate detection of truly differentially expressed genes in the presence of unknown forms of additional variation that remains in the read count; (2) extension of this framework to accommodate correlated cell or samples from the single-cell technology. The main application is the DE analysis on patients with Cystic Fibrosis (CF) lung disease. CF affects multiple organs with complex genetic epidemiology that goes beyond the causal CFTR mutations, and the majority of morbidity and mortality in CF is due to lung disease. CF lung disease is due to cycles of bacterial infection and inflammation. Therefore, it is important to understand the impact on gene expression in the presence of infection, as these DE genes could be targets of therapeutics Application of RoPE demonstrates that an active Pseudomonas Aeruginosa infection downregulates the SLC9A3 Cystic Fibrosis modifier gene. The method we are developing has the potential for extension to a general regression framework.
Samar Salah Mohamedahmed Elsheikh
Samar Salah Mohamedahmed Elsheikh
CANSSI Ontario STAGE Postdoctoral Fellow, Center for Addiction and Mental Health (CAMH), Pharmacogenetics Research Clinic
Genetic and Polygenic Risk Analysis of Antidepressant Response and Cognitive Domains in Late-Life Depression. Download Abstract.
Teresa Tsui
Teresa Tsui
CANSSI Ontario Top-up Award for Postdoctoral Fellows in Data Science, Sunnybrook Research Institute
Accounting for uncertainty in health utilities to inform cancer drug funding decisions Decisions on the public reimbursement of drugs rely heavily on the incremental cost effectiveness ratio of an economic evaluation, or the additional cost of a new treatment relative to the quality-adjusted life years (QALY) gained. QALYs are equal to health utilities multiplied by time, a key measure of effectiveness in economic evaluations. An important source of uncertainty in QALYs is universally overlooked, which challenges the validity of the economic evaluation. Methods: We implemented novel methodology to account for uncertainty in estimating predicted mean health utilities measured with the EuroQol-5-dimension (EQ-5D-5L) questionnaire collected as part of the Canadian valuation study (n = 1,208 general public) and a study to develop the breast cancer utility instrument (n = 401, patients with breast cancer). Results: A multiple imputed dataset (n = 100) was created for the Canadian EQ-5D-5L time-trade-off (TTO) responses using a previously developed Bayesian model with spatial correlation. Applying this model correctly accounts for uncertainty in EQ-5D-5L, an improvement from the established Canadian TTO scoring model. Discussion and Conclusions: Our study provides a practical example and code which will enable researchers and decision makers to better account for uncertainty in estimating predicted mean health utilities. This will ultimately result in better allocation of scarce healthcare resources.
YuChung Lin
YuChung Lin
CANSSI Ontario STAGE PhD Student, Division of Biostatistics, University of Toronto
Incorporating Robust Statistical Procedures into Model Selection Using Genomic Features with Small Effect Sizes Polygenic risk scores (PRS) have achieved reasonable success in identifying genetic risk for complex diseases by incorporating contributions from thousands of genetic variants. However, existing PRS derived mainly from the European population show limited predictive capacity and can suffer from instability when generalizing to diverse populations. Polygenic transcriptome risk scores, which rely on predicted gene expression levels, have shown improved cross-ethnic portability but often suffer from unreliable predictions for gene expression in practice. Here we propose expression based PRS (ePRS) that directly leverages publicly available eQTL summary statistics to selectively promote variants with significant eQTL evidence and penalize those without. Our method avoids a dichotomous classification of variants into eQTLs and non-eQTLs as in previous studies but relies on a continuous evidence measure (e.g., log of p-values) to adaptively adjust the estimated effect size for each variant. We demonstrate the robustness of ePRS through simulation and achieve greater predictive performance in the presence of covariate shift. Applying ePRS to the UK Biobank reveals improved generalizability for predicting lung functions in the Black British population compared to standard PRS methods.
Ziang Zhang
Ziang Zhang
CANSSI Ontario STAGE PhD Student, Department of Statistical Sciences, University of Toronto
Interpretation and adjustment of covariate effects in the GWAS of binary traits. In genome-wide association studies (GWAS), it is desirable to include and test the interaction effect (GxE) between single-nucleotide polymorphism (SNP,G) and environmental variable (E), to achieve higher power for the detection of causal SNPs. However, accounting for this interaction effect through direct testing is infeasible in most studies, because the information on environmental variable E is not available or is collected with a non-trivial amount of measurement error. On the other hand, the indirect testing method allows this interaction effect to be leveraged in the association testing without using the information of E. For quantitative traits (Y) that are approximately normally distributed, it has been shown that indirect testing on GxE interaction can be done by testing the heteroskedasticity of Y between genotypes. Therefore, screening SNPs with strong signal of heteroskedasticity helps to identify potential causal SNPs that should be studied in a more detailed follow-up analysis, when the traits are quantitative. However, when traits are binary, the existing methodology based on testing for the heteroskedasticity between genotypes cannot be generalized. By examining the extra dominance effect, we proposed a novel methodology for indirect testing or levering the (GxE) interaction effect for binary traits. Through simulation studies, we will show that the joint testing with the proposed approach can better detect associated SNPs compared to the traditional additive testing method. We illustrate the use of the proposed method by applying it to the UK Biobank dataset for a complete GWAS study.
Xihong Lin
Xihong Lin
Professor, Departments of Biostatistics and Statistics Coordinating Director, Program in Quantitative Genomics Harvard T.H. Chan School of Public Health Harvard University

The event is finished.

Local Time

  • Timezone: America/New_York
  • Date: Sep 29 2022
  • Time: 9:45 am - 5:30 pm
Chestnut Conference Centre - University Room

Location

Chestnut Conference Centre - University Room
89 Chestnut St, Toronto, ON

Speakers

Esther Berzunza

Organizer

Esther Berzunza
Phone
416-689-7271
Email
esther.berzunza@utoronto.ca