8 Machine learning
Background
‘Machine learning’ (ML) is an umbrella term that encompasses classical statistical methods, linear models (both simple and generalized), and non-parametric techniques such as decision trees, support vector machines, autoencoders, and neural networks. Each of these methods are called machine learning ‘engines’ when used to train a model.
ML forms the foundation of artificial intelligence. For instance, large language models such as chatGPT, are trained using deep neural networks.
Machine learning refers to the use of computers (machines) to identify statistical trends in data and store those trends/relationships in a ‘model’. The process of identifying trends (‘learning’) and then predicting features of newly provided data (‘predicting’) is common to all ML workflows.
ML to predict the status (label) of new samples, is termed ‘supervised ML’.
In ‘unsupervised’ ML, the model is trained on data without labels. The model tries to find patterns, groupings, or structures in the data, often by performing pair-wise sample similarity tests. This is useful when we don’t know the categories in advance, such as clustering patients based on gene expression profiles, or identifying hidden patterns in imaging data.
Comparing ‘data-driven’ groupings to existing category labels, can either confirm existing categories, or provide evidence for revising them. Principal components analysis, multidemsional scaling and tree-based clustering algorithms are examples of unsupervised ML.
Biomedical Context
ML has many exciting applications in biomedical research. In this workshop we will run a supervised ML workflow on data including histological (imaging) features of tissue slices, with matched labels indicating whether the tissue is healthy/benign or malignant (cancerous).
Terminology
In training ML models, the machine ‘learns from the training data set’ and ‘predicts the validation data set’. The accuracy of the predictions on the validation data is used to update the model (known as ‘reinforcement learning’).
Once the model has been trained, it is tested on a ‘hold-out’ or ‘test’ dataset, which has not been used at any point in training or validation. The accuracy of performance on the test set is the ultimate test of the ‘predictive value’ of a model, that is, how valuable it is for informing real-world decisions.
Aim
Train a supervised machine learning model (classifier) to predict tumour status based on histological features.
Methods
This work will proceed in 3 parts:
exploring the data and trialing simple statistical analyses
training a machine learning model using pre-defined hyperparameters
optimizing the model using a ‘hyperparameter tuning’ process
Tidymodels package
tidymodels is an R package that encompasses multiple sub-packages, all of which work together with tidy data.
In the past, each ML engine was siloed in its own package, with unique functions and arguments. tidymodels provides a consistent user experience for data pre-processing, training, validation, model tuning and prediction, and allows us to switch between ML engines quickly and easily!
To learn more about machine learning in R, check out the excellent book Tidy Modelling with R