Data Science

w/ Dr Greg Gloor

Course Overview

MEDSCIEN 9506 is an advanced, theory-driven course focused on developing critical intuition for analyzing complex biomedical data, particularly high-throughput sequencing and other multivariate datasets. Emphasizing reproducible research practices, the course introduces students to R programming, R Markdown, and principled data management while building a strong conceptual foundation in Bayesian thinking, compositional data analysis, and error structures in biological data. Students will explore methods such as PCA, machine learning, Markov models, and generative AI, with an emphasis on understanding assumptions, limitations, and common analytical pitfalls. Through written assignments, an oral assessment, and independent learning, the course aims to strengthen students’ ability to critically evaluate data, identify flawed reasoning, and make informed analytical decisions in biomedical research.

Assignment Highlights

1. Assignment One

This assignment reinforced the importance of approaching biomedical literature with analytical skepticism rather than accepting reported associations at face value. Evaluating the microbiome–social anxiety study required me to think carefully about hypothesis plausibility, effect size, confounding, and incentive structures, all of which align closely with the core objectives of MEDSCIEN 9506. Rather than asking only whether an association existed, I was pushed to consider how strong it was likely to be, why it might appear, and under what conditions it would fail to replicate.

One of the most instructive aspects of the assignment was identifying unaddressed confounders, particularly SSRI use and psychiatric comorbidity. This exercise highlighted how easily biologically plausible narratives such as gut–brain interactions can overshadow alternative explanations when working with high-variance data like the microbiome. Anticipating a moderate effect size helped prevent overinterpretation of statistically significant but limited shifts in beta diversity and select taxa, reinforcing the idea that “detectable” does not necessarily mean “clinically meaningful.”

The conditional probability thought experiment was especially effective in strengthening my Bayesian intuition. Explicitly calculating positive predictive value across different prevalence scenarios demonstrated how test performance depends far more on base rates than intuition often suggests. This reinforced why many proposed biomarkers or screening tools perform poorly when translated beyond controlled research settings.

Overall, this assignment helped sharpen my ability to detect overconfidence, hidden assumptions, and exaggerated implications in scientific papers. It emphasized that rigorous analysis requires not only technical competence, but also disciplined reasoning about uncertainty, incentives, and contex and skills that are essential for interpreting complex biomedical data responsibly.

Overall Course Reflection

This course required me to actively engage with data science tools that challenged how I translate conceptual understanding into computational logic. Working with coding, data cleaning, and analysis forced me to slow down, troubleshoot systematically, and accept that fluency develops through iteration rather than immediate mastery. Instead of avoiding unfamiliar code, I practiced breaking problems into smaller steps, interpreting errors as feedback, and using documentation and examples to refine my approach. Through repeated exposure, I became more comfortable working with data programmatically and more confident navigating uncertainty in quantitative workflows. This experience reshaped how I approach technical problem-solving and strengthened my ability to integrate data science methods into interdisciplinary medical research.