2023 Workshop on Reproducibility

CANSSI Ontario and the Data Sciences Institute at the University of Toronto are excited to host the Toronto Workshop on Reproducibility in February 2023. This two-day workshop brings together academic and industry participants on the critical issue of reproducibility in applied statistics and related areas.


Hourly Schedule

Wednesday, February 22, 2023.

08:30 - 17:15
Toronto Replication Games.
Participants will be matched with other researchers working in the same field (e.g., economics, American Politics). Each team will work on replicating a recently published study in a leading econ/poli sci journal.

Interested researchers and teams should contact Abel Brodeur (abrodeur@uottawa.ca).

Thursday, February 23, 2023.

08:50 - 17:15
Workshop on Reproducibility.
This hybrid workshop is free and open to all.

The Workshop has three broad focus areas:

  1. Evaluating reproducibility: Systematically looking at the extent of reproducibility of a paper or even in a whole field is important to understand where weaknesses exist. Does, say, economics fall flat while demography shines? How should we approach these reproductions? What aspects contribute to the extent of reproducibility.

  2. Practices of reproducibility: We need new tools and approaches that encourage us to think more deeply about reproducibility and integrate it into everyday practice.

  3. Teaching reproducibility: While it is probably too late for most of us, how can we ensure that today’s students don’t repeat our mistakes? What are some case studies that show promise? How can we ensure this doesn’t happen again?
08:50 - 09:00
Opening Remarks.
Rohan Alexander, University of Toronto.
09:00 - 09:15
Reproducible Teaching in Statistics and Data Science Curricula.
Mine Dogucu, University College London & University of California Irvine.

Teaching reproducibility.

Abstract: In reproducibility, we often focus on 1) reproducible research practices and 2) teaching these practices to students. In this talk, I will talk about a third dimension of reproducibility: reproducible teaching. Instructors use tools and adopt practices in preparing their teaching materials. I will discuss how reproducibility relates to these tools and practices. I will share examples from my statistics and data science courses and make recommendations based on teaching experiences.
09:15 - 09:30
Reproducible Student Project Reports with Python + Quarto.
Debbie Yuster, Ramapo College of New Jersey.

Teaching reproducibility.

Abstract: R users have long enjoyed the ability to render professional looking documents using R Markdown. Output formats include reports, blog posts, presentation slides, books, and more. These documents can contain a mixture of narrative, code, and code output, so they are ideally suited to reproducible work. Results and figures can be generated upon rendering, greatly minimizing the risk of copy/paste errors and outdated results. The benefits of R Markdown are now available to users of Python and Julia, in the form of Quarto, an R Markdown successor. Since Fall 2022, I have required my Data Science students to create their project reports using Python + Quarto. In this talk, I’ll introduce Quarto and some of its features, and report on my students’ experience learning and using it..
09:30 - 09:45
Moon and suicide - a case in point example of debunking a likely false positive finding.
Martin Plöderl, Paracelsus Medical University, Salzburg, Austria.

Practices of reproducibility.

Abstract: In my presentation, I will summarize the process of replicating a surprising finding of researchers who reported about a statistically significant increase of suicide rates during full moon in northern Finland, but only among younger women, and only in winter. We failed to replicate this finding with much larger samples based on the Austrian and Swedish suicide register. The finding from Finland was likely false positive. I will discuss problematic research and publication practices related to these findings.
09:45 - 10:00
Audience Q&A and/or break.
10:00 - 10:15
Code execution during peer review - you can do it, too!
Daniel Nüst, CODECHECK & Reproducible AGILE | TU Dresden.

Evaluating reproducibility.

Abstract: Daniel is a research software engineer and postdoc with the Chair of Geoinformatics, TU Dresden, Germany. He develops tools for open and reproducible geoscientific research and is a proponent for open scholarship and reproducibility in the projects NFDI4Earth, o2r, and OPTIMETA and in the CODECHECK initiative.
10:15 - 10:30
Towards greater standardization of reproducibility: TrovBase approach.
Sam Jordan, TrovBase.

Practices of reproducibility.

Abstract: Research code is difficult to understand and build upon because it isn’t standardized; research pipelines are artisan. TrovBase is a data management platform that standardizes the process from dataset configuration to analysis, and does so in a way that makes sharing analysis (and building upon it) easy and fast. The TrovBase team will discuss how to make graphs and analysis maximally reproducible using TrovBase.
10:30 - 10:45
Sharing the Recipe: Reproducibility and Replicability in Research Across Disciplines.
Rima-Maria Rahal, Max Planck Institute for Research on Collective Goods.

Practices of reproducibility.

Abstract: The open and transparent documentation of scientific processes has been established as a core antecedent of free knowledge. This also holds for generating robust insights in the scope of research projects. To convince academic peers and the public, the research process must be understandable and retraceable (reproducible), and repeatable (replicable) by others, precluding the inclusion of fluke findings into the canon of insights. In this contribution, we outline what reproducibility and replicability (R&R) could mean in the scope of different disciplines and traditions of research and which significance R&R has for generating insights in these fields. We draw on projects conducted in the scope of the Wikimedia "Open Science Fellows Program" (Fellowship Freies Wissen), an interdisciplinary, long-running funding scheme for projects contributing to open research practices. We identify twelve implemented projects from different disciplines which primarily focused on R&R, and multiple additional projects also touching on R&R. From these projects, we identify patterns and synthesize them into a roadmap of how research projects can achieve R&R across different disciplines. We further outline the ground covered by these projects and propose ways forward.
10:45 - 11:00
Audience Q&A and/or break.
11:00 - 11:15
Certifiying reproducibility.
Lars Vilhuber, Cornell University.

Evaluating reproducibility.

Abstract: One of the goals of reproducibility - the basis for all subsequent inquiries - is to assure users of a research compendium that it is complete. How do we do that? We re-run code. But what if the data underlying the compendium is confidential (sensitive)? What if it is transient (Twitter)? What if it is so big that it takes weeks to run? All of the above? I will talk about efforts in designing a way to credibly convey that the compendium has run at least once, and the many questions that might arise around that.
11:15 - 11:30
Accessible reproducibility for biological researchers.
Claudia Solis-Lemus, University of Wisconsin-Madison.

Practices of reproducibility.

Abstract: Reproduciblity is challenging for everyone, but for biological researchers that have not been trained in good computing practices, maintaining a reproducible practice might appear impossible on first glance. We will go over specific strategies for researchers that do not come from computational backgrounds..
11:30 - 11:45
A Computational Reproducibility Investigation of the Open Data Badge Policy in one Issue of Psychological Science.
Sophia Crüwell, University of Cambridge / Charité Medical University Berlin.

Evaluating reproducibility.

Abstract: I will present a study that looked at the Open Data badge policy at the journal Psychological Science. We attempted to reproduce 14 articles (at least 3 independent reproduction attempts each) that received the Open Data badge, and found that only 1 was exactly reproducible and 3 further articles were essentially reproducible. I will discuss our results and recommendations for the implementation of Open Data badges as incentives for increasing reproducibility and transparency.
11:45 - 12:00
Audience Q&A and/or break.
12:00 - 12:15
How to meld open science and reproducibility today to live on Mars tomorrow.
Rob Reynolds, KBR / NASA

Teaching reproducibility.

Abstract: Rob is a Data Scientist with NASA's Johnson Space Center in Houston, TX. Originally trained as an epidemiologist and biostatistsician, he helps NASA formalize the process of explaining and quantifying the risks to humans from spaceflight.
12:15 - 12:30
A common pipeline for curating electronic health records data to enhance reproducibility of real-world evidence studies.
Jue Hou, University of Minnesota/Jesse Gronsbell, University of Toronto

Teaching reproducibility.

Abstract: Electronic health records (EHRs) are becoming a central source of data for biomedical research and have potential to improve our understanding of healthcare delivery and disease processes. However, the analysis of EHR data remains both practically and methodologically challenging as it is recorded as a byproduct of clinical care and not generated for research purposes. In this talk, we will describe the reproducibility challenge in EHR-based research and introduce our ongoing work developing a pipeline for real-world evidence with EHRs.
12:30 - 12:45
Audience Q&A and/or lunch.
12:45 - 13:00
Lunch Break
13:00 - 13:15
Reproducible Open Science for All.
Yanina Bellini Saibene, rOpenSci.

Practices of reproducibility.

Abstract: Open Source and Open Science are global movements, but there is a dismaying lack of diversity in these communities. Non-English speakers and researchers working from the Global South face a significant barrier to being part of these movements. rOpenSci is carrying out a series of activities and projects to ensure our research software serves everyone in our communities, which means it needs to be sustainable, open, and built by and for all groups.
13:15 - 13:30
A reproducible workflow and software tools for working with the Global Extreme Sea Level Analysis (GESLA) dataset.
Fernando Mayer, Maynooth University.

Practices of reproducibility.

Abstract: In this talk, we are going to show a general reproducible workflow, in the context of the project entitled "Estimating sea levels and sea-level extremes for Ireland". We will demonstrate a set of software tools used to deal with a large, worldwide, sea level dataset, called GESLA (Global Extreme Sea Level Analysis). This workflow and set of tools can hopefully help other researchers in adopting the practice of reproducibility.
13:30 - 13:45
Qualitative Transparency Tools and Practice in Sexual and Reproductive Health Research.
Marielle Kirstein, Guttmacher Institute.

Practices of reproducibility.

Abstract: Reproducibility is fundamental to the open science movement to ensure science is transparent and accessible, but much of the work on reproducibility has come from quantitative research and data. However, the principles of transparency are equally relevant to qualitative researchers despite some unique challenges implementing transparent practices, given the nature of qualitative data. In this presentation, we will introduce the principles and concepts that underpin qualitative transparency and describe how we at the Guttmacher Institute have been developing and implementing qualitative transparency practices in our work. Guttmacher conducts policy-relevant research on sexual and reproductive health, and our qualitative data often includes sensitive content, underlining the ethical imperative to protect participant confidentiality. We will describe how we have embedded transparency into our qualitative projects through the use of transparency launch meetings and checklists, among other practices, and we will highlight previous and current projects at Guttmacher that are making some aspects of their projects publicly available.
13:45 - 14:00
Audience Q&A and/or break.
14:00 - 14:15
Evaluating the Reproducibility and Reusability of Transfer Drug Response Workflows .
Grace Yu, University of Toronto & Emily So, University of Toronto

Evaluating reproducibility.

Abstract: With recent advances in molecular profiling and computational technologies, there has been growing interest in developing and using machine learning (ML) and artificial intelligence (AI) techniques in personalized medicine and precision oncology. An active area of research in this domain is focused on the development of computational models capable of predicting therapy response for cancer patients. In a recent publication in Nature Cancer, Ma et al. presented a novel approach, named "Transfer of Cell Line Response Prediction" (TCRP), which utilizes few-shot learning to transfer drug response prediction from immortalized cancer cell line data to more complex in vitro patient-derived cell cultures and in vivo patient-derived xenografts. The authors demonstrated the effectiveness of their method in enabling the development of computational models that can accurately predict drug response in various contexts. Given the impressive results, we aim to address two main issues: (1) validating the performance of the TCRP model in its published context (reproducibility) and (2) extending its applicability to a broader range of preclinical pharmacogenomic and clinical trial data (reusability). The deployment of models such as TCRP will significantly contribute to improving personalized medicine by facilitating the selection of optimal therapy for individual patients based on their molecular profile.
14:15 - 14:30
Reproducibility and Dataset Construction: Digitizing the Australian Hansard.
Lindsay Katz, University of Toronto

Practices of reproducibility.

Abstract: While approaches to reproducibility in code are well-established, there is less focus on reproducibility in the context of datasets. In this talk, I will introduce an approach to enhancing the reproducibility of dataset construction through automated data testing. My joint work with Dr. Rohan Alexander on digitizing the Australian Hansard will be discussed as a case study, with specific examples of data validation and reproducible practices from our work.
14:30 - 14:45
Audience Q&A and/or afternoon tea.
14:45 - 15:00
Afternoon tea.
15:00 - 15:15
TBC.
15:15 - 15:30
Bridging the gap between data availability and reproducibility.
Aya Mitani, University of Toronto.

Evaluating reproducibility.

Abstract: Journals claim that one of the reasons authors are required to make data available is to facilitate reproducibility in research. However, most of the time, when data are publicly available, they are not in the right format to perform the analysis. In this talk, I share the challenges I have experienced in trying to reproduce the results of some published papers using data sources presented in their data availability statements. I also suggest some ways to improve the reproducibility pipeline, especially when the analytical data set is created from multiple data sources.
15:30 - 15:45
Errors of Interpretation.
Nick Radcliffe, Stochastic Solutions.

Practices of reproducibility.

Abstract: "If our results are to have any useful impact in the world, they not only have to be (broadly) correct they also have to be interpreted correctly, and it’s our responsibility, as data scientists, to maximize the chances that this will be the case. What can we do to increase the likelihood of correct interpretations, and can software help?".
15:45 - 16:15
Audience Q&A and/or break.
16:15 - 16:30
Git is my lab book: "baking in" reproducibility.
Teaching reproducibility.

Abstract: I am part of an infectious diseases modelling group that has informed Australia's national pandemic preparedness and response plans for the past ~15 years. In collaboration with public health colleagues since 2015, we have developed and deployed near-real-time seasonal influenza forecasts. We rapidly adapted these methods to COVID-19 and, since April 2020, near-real-time COVID-19 forecasts have informed public health responses in Australian states and territories. Ensuring that our results are valid and reproducible is a key aspect of our research. We are also part of a broader consortium whose remit includes building sustainable quantitative research capacity in the Asia-Pacific region. In this talk I will discuss how we are trying to embed reproducible research practices into our EMCR cohort, beginning with version control workflows and normalising peer code review as an integral part of academic research.
Speakers:
Rob Moss
16:30 - 16:45
The consequences of excel autocorrection on genomic data.
Mandhri Abeysooriya, Deakin University.

Practices of reproducibility.

Abstract: Erroneous conversions of gene names into other types of data, such as dates and numbers, have been a long-standing issue in computational biology. This issue can have a significant impact on data reproducibility and the advancement of science and technology in the field. While this problem was first identified in 2004 and has been studied extensively in 2016 and 2021, it continues to occur. Through observation, it has been found that gene names can not only be converted to dates and floating numbers, but also to an internal date format of five-digit numbers. This highlights the limitations of using spreadsheets to manage and analyze large genomics data. Spreadsheets can misinterpret gene names as other types of data, leading to inaccuracies and inconsistencies in the data. This can make it challenging to reproduce results and hinder progress in the field. To improve data reproducibility and support the development of science and technology, it may be necessary to consider alternative methods for managing and analyzing large genomics data. In summary, it is crucial to use the appropriate software tools to handle large genomics data to avoid inaccuracies and inconsistencies in the data that can impede the progress of science and technology.
16:45 - 17:00
Audience Q&A and/or break.
Rob Moss
Rob Moss
Senior Research Fellow
Dr Rob Moss is a Senior Research Fellow in the Infectious Disease Dynamics Unit at the Melbourne School of Population and Global Health, The University of Melbourne. He is a postdoctoral member of the SPECTRUM NHMRC CRE, and collaborator in the public health pillar of the APPRISE NHMRC CRE. He was awarded an APPRISE Research Fellowship for 2020-2021. His research focus is on the development and application of high-performance computational modelling methods for predicting and mitigating the burden of seasonal and pandemic influenza. This includes the use of scenario modelling to inform recommendations for specific interventions, such as targeted antiviral distribution, and synthesising disease models and surveillance data to provide near-real-time epidemic forecasts. He collaborates with public health staff across Australia, and is also a member of the WHO Influenza Incidence Analytics Group (IIAG), where he leads the Australian and regional discussions. In addition to these research interests, he is actively interested in broader issues related to model-driven science, including the dissemination of models (including source code, parameter sets, analysis scripts, etc), and effective communication of research outputs through the use of visualisations.

The event is finished.

Local Time

  • Timezone: America/New_York
  • Date: Feb 22 - 23 2023
  • Time: 8:00 am - 6:00 pm
Rohan Alexander

Organizer

Rohan Alexander
Website
https://rohanalexander.com/