Posted: 23 Mar 2020 | 10:45
I was recently working with a colleague to investigate performance issues on a login node for one of our HPC systems. I should say upfront that looking at performance on a login node is generally not advisable, they are shared resources not optimised for performance.
We always tell our students not to run performance benchmarking on login nodes, because it's hard to ensure the results are reproducible. However, in this case we were just running a very small (serial) test program on the login node to ensure it worked before submitting it to the batch systems and my colleague noticed a performance variation across login nodes that was unusual.
Posted: 9 Oct 2019 | 17:30
Sharing of resources has challenges for the performance and scaling of large parallel applications. In the NEXTGenIO project we have been focusing specifically on I/O and data management/storage costs, working from the realisation that current filesystems will struggle to efficiently load and store data from millions of processes or tasks all requesting different data sets or bits of information.
Posted: 27 Feb 2019 | 15:53
The MPI Standard states that nonblocking communication operations can be used to “improve performance… by overlapping communication with computation”. This is an important performance optimisation in many parallel programs, especially when scaling up to large systems with lots of inter-process communication.
However, nonblocking operations can also help with making a code correct – without introducing additional dependencies that can degrade performance.
Posted: 12 Dec 2017 | 11:16
November 2017 Top500
My initial impression of the latest Top500 list, released last month at the SC17 conference in Denver, was that little has changed. This might not be the conclusion that many will have reached, and indeed we will come on to consider some big changes (or perceived big changes) that have been widely discussed, but looking at the Top 10 entries there has been little movement since the previous list (released in June).
Posted: 24 May 2017 | 19:30
When we parallelise and optimise computational simulation codes we always have choices to make. Choices about the type of parallel model to use (distributed memory, shared memory, PGAS, single sided, etc), whether the algorithm used needs to be changed, what parallel functionality to use (loop parallelisation, blocking or non-blocking communications, collective or point-to-point messages, etc).
Posted: 11 May 2017 | 00:06
As part of the ARCHER Knights Landing (KNL) processor testbed, we have produced and collected a set of benchmark reports on the performance of various scientific applications on the system. This has involved the ARCHER CSE team, EPCC's Intel Parallel Computing Center (IPCC) team, and various users of the system all benchmarking and documenting the performance they have experienced.
Posted: 11 Apr 2017 | 17:59
Shall I compare thee...
Performance comparisons are always tricky to get exactly right. They are needed to ensure that we can demonstrate the performance improvements that optimisations, new hardware, new algorithms, etc... have had on an application or benchmark, but there is a lot of latitude in what can be compared, which makes it easy to get a performance comparison wrong and not properly demonstrate whatever it is you're trying to show.
Posted: 10 Mar 2017 | 15:39
Measuring performance is a key part of any code optimisation or parallelisation process. Without knowing the baseline performance, and what has been achieved after the work, it's impossible to judge how successful any intervention has been. However, it's something that we, as a community, get wrong all the time, at least when we present our results in papers, presentation, blog posts, etc... I'm not suggesting that people aren't measuring performance correctly, or are deliberately falsifying performance improvements, but the incentives to make your work look as impressive as possible causes people to present results in a way that really isn't justified.
Posted: 29 Jul 2016 | 16:45
Initial experiences on early KNL
Updated 1st August 2016 to add a sentence describing the MPI configurations of the benchmarks run.
Updated 30th August 2016 to add CASTEP performance numbers on Broadwell with some discussion
KNL is a many-core processor, successor to the KNC, that has up to 72 cores, each of which can run 4 threads, and 16 GB of high bandwidth memory stacked directly on to the chip.
Posted: 21 Jun 2016 | 17:13
There's been a lot of discussion about the latest Top500 list, released this week at ISC16. Most of the interest has been in the whopping new Chinese system, Sunway TaihuLight, which has come in at number 1 on the list with a massive 93 PFlop/s rpeak Linpack performance, and 125 PFlop/s rmax theoretical peak performance (3 times bigger than the previous number 1 system).
Whilst this is a very interesting system, and much bigger than is currently planned elsewhere, it's not unknown for very large systems to come in and dominate the list like this. Back in 2002, the Japanese Earth Simulator system became the number 1 machine with an rpeak of ~5x that of the previous number 1 system, and it stayed as the top machine for a number of years.