Part 1

Setup

library(tidyverse)
library(readxl)
library(broom)
library(ggridges)

library(tidyrstats)

theme_set(theme_minimal())

Download data & Quarto files

Save into your data/ directory.

To run this workshop in quarto on your local machine:

then unzip, and copy the files into your R project home directory e.g. BIOL90042_R_Course.

Read data

tumour_data <- readxl::read_xlsx('data/tumour_data.xlsx')

tumour_data

# A tibble: 300 × 42
   donorid status    cell_size cell_density nuclear_area membrane_curvature
   <chr>   <chr>         <dbl>        <dbl>        <dbl>              <dbl>
 1 d001    malignant     0.170        0.654        0.877              2.50 
 2 d002    malignant     2.95         1.42         0.677              4.78 
 3 d003    malignant     4.79         4.66         0.593              6.96 
 4 d004    benign        4.67         1.74         0.423              0.928
 5 d005    malignant    13.0          0.887        0.489              3.01 
 6 d006    benign        3.78         1.02         0.486              0.580
 7 d007    malignant     0.663        8.53         1.03               2.17 
 8 d008    benign        0.950        1.09         0.581              5.10 
 9 d009    malignant     8.39         3.64         0.955              8.78 
10 d010    benign        9.82         7.39         0.403             15.3  
# ℹ 290 more rows
# ℹ 36 more variables: granularity <dbl>, border_diffuseness <dbl>,
#   vascular_density <dbl>, necrosis_extent <dbl>, stromal_thickness <dbl>,
#   mitotic_count <dbl>, texture_1 <dbl>, texture_2 <dbl>, texture_3 <dbl>,
#   texture_4 <dbl>, texture_5 <dbl>, texture_6 <dbl>, texture_7 <dbl>,
#   texture_8 <dbl>, texture_9 <dbl>, texture_10 <dbl>, intensity_1 <dbl>,
#   intensity_2 <dbl>, intensity_3 <dbl>, intensity_4 <dbl>, …

The data contains 300 tissue samples which have been imaged in a histology lab, as referenced in this chapter Background. Each sample has the anonymized donor id and malignancy status determined by a trained pathologist (benign or malignant).

Automatic quantification of image features produces 40 ‘features’ including cell shapes, edge (plasma membrane) definition, cytosolic texture, staining intensity, necrosis, vascularity etc.

Explore the data

Reshape for plotting

dat_long <- tumour_data %>% 
  pivot_longer(names_to = "key", values_to = "value", 
               cols = -c(donorid, status))

Check distributions

dat_long %>%   ggplot(aes(x=value,group=key)) + geom_density()

The data doesnt look normally distributed, but there is a big over-plotting problem here. Let’s use the ggridges package to produce a single density plot for each feature:

dat_long %>%  ggplot(aes(x=value)) + 
  geom_density_ridges(aes(y=key))

Picking joint bandwidth of 0.646

The measurements in the dataset have a positive skew, meaning they are non-normal. For best results in machine learning (and data analysis more generally) its usually best to transform the data into a normal distribution where possible.

Log transform

dat_long_log <- dat_long %>%
  mutate(value=log10(value))

Plot transformed data

dat_long_log %>%  ggplot(aes(x=value)) + 
  geom_density_ridges(aes(y=key))

Picking joint bandwidth of 0.0993

The data looks more normal now. Note that the mean values are different. This can be addressed by scaling data, which we return to later.

Cluster analysis

Let’s run an unsupervised machine learning method to explore how the samples group together using a ‘data-driven’ approach. pr_comp() gives us principal components analysis, and requires wide-format data.

Reshape log-transformed data to wide format, and calculate principal components using the quantitative features (i.e., everything except donorid and malignancy status).

dat_wide_log <- dat_long_log %>% 
  pivot_wider(names_from = "key", values_from="value") 
  
pc_dat <- dat_wide_log %>% select(-c(donorid,status)) %>% 
        #principal components function
  prcomp()

We can access the principal components for each sample by using the broom::augment() function. The primary PCs (e.g. 1 thru 5) encode the majority of the variation between samples. As such we focus on primary PCs (or dimension in MDS), when clustering samples

Plot first 2 PCs, coloured by tumour status

pcs_fitted <- pc_dat %>% augment() 
  
#Join the sample labels to the fitted PCs
pc_plot_dat <- bind_cols(tumour_data %>% select(status), 
                         pcs_fitted) 

#Scatter plot
pc_plot_dat %>% 
  ggplot(aes(.fittedPC1, .fittedPC2)) + 
  geom_point(aes(col=status), size=2)

There is certainly not a clear separation between malignant and benign. This indicates we will need to try a more sensitive/advanced method to reliably predict the tissue status.

Linear model

Lets try one of the simplest ML engines - the linear model, to see whether any of the imaging features are significantly associated with tissue status.

If we define a model where imaging feature is the outcome (y) and tumour status is the predictor (x) we can test for associations quite easily, using lm_test().

Our experimental question is ‘Are any features (y) significantly different between benign (x = 0 ) and malignant (x=1)?’

#marginal effects
dat_long_log %>% group_by(key) %>% 
  lm_test( value ~  status  ) %>% filter(term!='intercept')

Results for linear model: value ~ status

# A tibble: 40 × 6
   key                term            estimate std.error statistic       p.value
   <chr>              <chr>              <dbl>     <dbl>     <dbl>         <dbl>
 1 granularity        statusmalignant   0.280     0.0446      6.28 0.00000000121
 2 intensity_8        statusmalignant   0.146     0.0240      6.10 0.00000000325
 3 membrane_curvature statusmalignant   0.228     0.0481      4.75 0.00000317   
 4 shape_metric_4     statusmalignant  -0.229     0.0486     -4.72 0.00000361   
 5 vascular_density   statusmalignant   0.108     0.0244      4.43 0.0000133    
 6 stromal_thickness  statusmalignant   0.172     0.0470      3.65 0.000305     
 7 intensity_5        statusmalignant   0.0773    0.0247      3.13 0.00193      
 8 intensity_4        statusmalignant   0.0691    0.0274      2.52 0.0123       
 9 texture_2          statusmalignant  -0.0657    0.0350     -1.88 0.0617       
10 texture_10         statusmalignant  -0.0791    0.0452     -1.75 0.0811       
# ℹ 30 more rows

We can see several features do have significant associations with tissue status! This means that, although those feature associations are not large/strong enough to drive sample separation in PCA, the should be able to be combined to build a good predictive model.

Confirm LM results graphically

#features_of_interest (FOI)
foi <- c('granularity','intensity_8','membrane_curvature')

dat_long_log %>% 
  filter(key %in% foi) %>% 
  ggplot(aes(x=status , y=value)) + 
  geom_boxplot(outlier.alpha = 0) +
  geom_jitter(aes(group=status), width=0.2,size=0.5) +
  facet_wrap(~key)

Save output

For the next steps, we will save the log-transformed, wide-format data (used for the PC analysis).

writexl::write_xlsx(dat_wide_log,
                    'data_processed/tumour_data_log_wide.xlsx')