New MSc course: "Machine Learning at Scale"

28 March 2024

As part of the continual evolution of EPCC's educational programmes and offerings, we have introduced a new course to the MSc teaching provided by EPCC. "Machine Learning at Scale" is designed to bridge the gap between basic machine learning development and use, and the exploiting of large scale high performance computing systems and hardware. 

An image generated with DALL.E using ChatGPT that proports to show machine learning at scale. It is an image of a large colourful computer surrounded by a range of people working on desktops of standing with clipboards.

Image above generated using DALL.E through ChatGPT.

Machine Learning (ML) models have grown in scale and computational complexity, providing greatly improved functionality and quality, but requiring very large amounts of resources. A large language memory (LLM) can routinely use thousands of GPUs during a training session, with the model itself requiring tens of GPUs simply to fit within the available GPU memory.

Developing and running ML models, such as deep neural networks (DNNs) or convolution neural networks (CNNs) on a local computer, using a single GPU or accelerator device, can be accomplished with relative ease given the functionality that ML frameworks such as PyTorch, TensorFlow, and JAX now provide. They also have significant capabilities to optimise single GPU ML applications, through just-in-time compilation, use of optimised maths and tensor libraries, and other enhanced features.

However, as the quality and performance of ML models tends to depend on the size and scale of neural networks used, their size and computational complexity has rapidly increased over recent years. This pushes the demand for memory and computational capacity, either meaning models take large amounts of time to train on a single GPU, or the models themselves no longer fit on a single GPU.

The skills required to parallelise ML models, exploit large numbers of GPUs, and optimise data storage, I/O, and communication networks for large scale ML models have many similarities to the skills required for large scale computational simulation, activities that we at EPCC have been heavily involved with for the past 30 years. 

This has placed us in a good position to marry up the knowledge we have on efficiently exploiting large scale computers with the emerging field of large scale ML models, to provide students with the skills, knowledge, and approaches required to support, develop, and exploit machine learning now and in the future.

In the Machine Learning at Scale course we teach the basics of neural networks, how they utilise common HPC hardware, the methods and methodologies used to parallelise ML models, and the range of optimised hardware that is available for ML operations, both in training networks, and in deploying them at scale (inference). We provide a range of hands-on experience developing, optimising, and deploying ML approaches, using the zoo of HPC systems available to EPCC.

Find out more

If you're interested in these types of knowledge and skills, consider signing up for one of our MSc programmes. Or contact me to discover whether we can provide training or training materials for you or your organisation.


Mr Adrian Jackson
Adrian Jackson