library(tidyverse)
library(tidyrstats)
library(limma)
BIOL90042 Final Project Background
Background
Parkinson’s Disease
Parkinson’s Disease (PD) is a neuro-degenerative condition featuring loss of dopaminergic neurons in the substantia nigra, at the base of the brain. Accumulation of misfolded alpha synuclein protein within neurons precedes dopaminergic neuronal cell death. Disrupted mitochondrial function, increased oxidative stress and neuroinflammation also occur in the brain of people with PD.
Motor symptoms of PD include slowed movements, tremor, posture instability/ loss of balance, unsteady gait and increased falls. Non-motor symptoms include depression, fatigue, cognitive impairment, psychotic symptoms and dementia.
The causes of PD are incompletely understood. There are many common genetic SNPs that are over-represented in PD patients, as identified via genome-wide association studies (GWAS). Aside from patients with a high burden of common genetic risk factors (aka ‘high polygenic risk’), around 10% of patients carry one of a handful of very rare SNPs which confer a very high risk of developing PD.
For many other patients, the cause of disease is ‘sporadic’ - that is, not due to genetics but potentially due to long-term exposure to environmental risks such as pesticides, heavy metals and/or air pollution.
PD motor symptoms can be managed in the early disease stage using the dopamine supplement L-DOPA, however treatments to prevent neuronal cell death are needed, and to date there is no cure for PD.
You can read more about PD in these resources:
https://myhealthcaretimes.blogspot.com/2025/05/parkinsons-disease-symptoms-treatment.html
Project Scenario
WEHI embarks on a multi-faceted research program to develop better treatments for PD. The program leverages technologies in which WEHI has world-leading expertise, including
population genetics,
patient-derived induced pluripotent stem-cell technology,
high-content drug screening, and
clinical trials.
Your role & audience persona
You are a team of biomedical researchers tasked with analysing the data generated by the Parkinson’s Disease research program, detailed below. You then present the research background, methods and your findings and conclusions, to your colleagues.
The audience are biomedically educated, but not expert in your research area.
Cohort overview
Population genetics
WEHI recruits a large cohort of 1000s of Australian PD patients and healthy controls, and performs a GWAS. The summary statistics for each genome-wide significant locus in the GWAS are then published.
WEHI PD Research Cohort: 150 PD patients, and 150 controls from the large GWAS cohort, volunteer to donate blood for further experimental research and genotype analysis at WEHI. These people are also eligible for enrollment in a clinical trial.
For any individual, their PD PRS can be calculated by solving a multivariate linear model as follows. First, for each GWAS risk locus, their individual risk genotype dosage (0/1/2) is multiplied by the corresponding GWAS risk coefficient (aka beta) in the summary statistics. Next, the products for each locus are summed together to give the total genetic risk for that individual.
RNA-seq from iPSC-derived neurons
To investigate differentially expressed genes in PD neurons compared to controls, pluripotent stem cells are isolated from the blood of 8 PD patients and 8 healthy age and sex-matched controls from the WEHI PD Research Cohort. The stem cells are treated with reagents that induce them to differentiate into dopaminergic neurons.
The newly created neurons are grown in culture under controlled laboratory conditions for two weeks. Their gene expression profiles are now determined almost entirely by genetics, with minimal influence from environmental confounders.
RNA is isolated from each cell culture, sequenced and mapped to the human genome. Counts for each gene for each donor are quantified and entered into a DGEList in R.
High-content drug screening
To identify chemical compounds that can protect neurons from cytotoxic stress, the WEHI National Drug Discovery Centre grows dopaminergic neurons from a healthy donor with low PD genetic risk, in a 1536-well plate. Compounds are then added to each well as follows:
192 test compounds in technical quadruplicate (i.e., each compound in 4 separate wells), with
192 controls in technical quadruplicate.
Each test compound has a matched control which has the same molecular weight and molecular constitutents, but different chemistry with no known biological activity.
The human enzyme target, and mechanism of action for each active test compound, is provided.
Agonists increase enzyme function, whereas antagonists reduce enzyme function.
After adding compounds, all wells are treated with MPP, a chemical that causes oxidative stress and mitochondrial damage, mimicking PD disease biology.
After 24 hours of incubation, neuronal viability is measured in a luminescence assay. Cells that survive and grow despite MPP exposure will have the highest luminescence values, whereas those damaged by MPP will have lower luminescence values.
Clinical trials
One of the hit compounds from the drug screen is already approved for medical treatment of other conditions by the Australian Therapeutic Goods Administration. 16 PD patients are enrolled in a clinical trial to test the effect of this drug on PD motor symptoms.
The patients are randomly assigned to L-DOPA (current standard of care), or the trial drug (potential new treatment). In addition, half of the patients in each treatment arm are randomly assigned to a supervised daily exercise routine.
All patients are given a wearable gyroscope patch worn on the chest throughout the trial. The gyroscope measures their vertical orientation (degrees off-axis) whilst walking throughout the day. Gyroscopes are not worn during the prescribed exercise routines.
A measurement of 0 in vertical orientation indicates perfectly upright.
This measurement, taken at multiple consecutive timepoints, is a proxy for balance and smoothness of motion whilst walking.
Note that healthy control subjects show little change in vertical orientation over time whilst walking.
The gyroscope data is collected transmitted in real time and processed by the WEHI Clinical Discovery Centre.
700 seconds of walking data is available for each patient.
Data analysis tips
- Each student should create a new R project in a new directory, named
BIOL90042_final_group[id]
, replacing [id] with your unique group number. - Create sub-directories e.g.
analysis/
data_processed/
charts/
tables/
- Download the project data pack as
BIOL90042_final_group[id]/data
Students can opt to focus on one dataset each at first, then work together to compile results and look for links between datasets.
Using a consistent project structure for all group members should allow analysis script files to be shared between team mates and to run on different machines without error.
To avoid file naming conflicts, we suggest to name scripts with your initials e.g. analysis/explore_prs_BA.R
Similarly, append your initials to plots and data saved from your scripts e.g.
ggsave('charts/prs_plot_BA.pdf')
write_tsv('data_processed/prs_summary_BA.tsv')
Presentation tips
Show first, then tell!
Communicating scientific findings requires convincing your audience that the data you have and the analysis steps you have taken, support your conclusions. This is best achieved by presenting graphs that represent the underlying data, and then adding statistical testing results as appropriate. Reporting summary statistics alone (e.g. using tables) is rarely sufficient to convince your audience to accept your results.
Data exploration
Create a new .R script analysis/explore_data.R
Gene labels
<- readRDS('data/entrez_genename_key.Rds') anno
Gene name (symbols), Entrez gene ID number, and description.
Gene Ontology terms
<- read_rds('data/gene_ontology.Rds') gene_go
Human genes grouped into gene ontology ‘gene sets’. Gene ontology term ID (go_id), Term description, and Entrez gene ID numbers.
PRS data
load('data/prs.Rda')
Top hits from PD GWAS [gwas_top_hits]. The alternate allele (alt) is the allele that increases disease risk. The beta describes the unit increase in disease risk for each additional risk allele.
Case and control genotype data [ca_co_genotypes]. For each of 150 cases and 150 controls, this table details their alternate allele dosage at each significant PD GWAS locus.
RNAseq data
load('data/rnaseq.Rda')
$counts
dge$samples
dge$genes dge
A DGEList object [dge] containing counts, samples and genes information.
Drug screen data
load('data/drugscreen.Rda')
Drug screening plate luminescence results [plate1].
Drug screening test and control compound locations [plate_meta].
Dictionary of compound known target enzymes and mechanisms of action [compound_dict].
Clinical trial data
load('data/gait.Rda')
[gait_data] contains information on the trial subject id (donorid), age, treatment arm, and exercise group, with 700 seconds of vertical orientation measurements taken whilst walking, from the wearable health-tech device.
Data exploration questions
What are the dimensions, column names, data structure and data types for each data set?
What are the distributions of the numeric data? What is the unique value count for the categorical data?
Where are data labels shared even though column names may vary?
What data reshaping commands will help to make the data sets amenable to tidyverse data wrangling, statistics and plotting?
Research questions
These questions are to prompt your exploratory analysis.
Population genetics
What is the distribution of common polygenic risk between 150 cases and 150 controls?
Approximately what range of PRS scores is most useful for determining disease risk?
Approximately what range of PRS scores is not useful for determining disease risk?
RNA-seq from iPSC-derived neurons
Which genes are differentially expressed (DE) between PD cases and healthy controls?
Which gene ontology families do the DE genes belong to?
Is the clustering of samples by gene expression consistent with their disease status as cases and controls?
High-content drug screening
Which compounds are significantly associated with increased neuronal viability after MPP exposure?
What are the gene names for the enzyme targets of those compounds?
What is the mechanism of action of those compounds?
Clinical trials
Which summary statistic is best for capturing differences in gait smoothness and balance between subjects in the clinical trial?
Is there a significant difference in gait between controls (L-DOPA) and new drug treatment groups?
Is there an effect of exercise, and/or age on gait?
Dataset links
Aside from gene expression differences, what other demographic features distinguish the RNAseq cases from controls?
Are any of the gene targets of hit compounds from the drug screen, differentially-expressed in RNAseq?
Are there other demographic or genetic features that are associated with both gait, and response to treatment in the clinical trial?