HA-PACS Project and its Status
Taisuke Boku, Professor CCS Graduate School of Systems and Information Engineering
HA-PACS Project is a 3-year project in CCS, University of Tsukuba, to develop a new generation of large scale GPU cluster for development of wide area of computational science as well as the research and development of new technology for interconnection network for parallel processing with accelerators. In this talk, the project overview and status report is provided.
The coverage application area of HA-PACS project includes particle physics, astrophysics, biophysics, geoscience, etc. The base cluster system with 800 TFLOPS of peak performance is dedicated. Moreover, we are developing a new technology to reduce the communication latency among parallel GPUs over different nodes supported by the concept of Tightly Coupled Accelerators (TCA). To develop a prototype of TCA, we use FPGA system to enable direct communication between GPUs on PCI-Express address space without help of CPU. Through this research, we will develop an elemental technology to realize the fast communication among accelerators for wide area of applications.
The base cluster system of HA-PACS has started its operation from February 2012, with ultra high speed node employing two Intel E5 (SandyBridge) CPUs and four NVIDIA Tesla M2090 to provide 2.99 TFLOPS of peak performance, which is currently the world fastest GPU accelerated node for massively cluster system. The number of nodes is 268 to provide a performance of 802 TFLOPS as peak. Various computational science teams are working to develop large scale GPU accelerated codes, and HA-PACS base clusgter will be dedicated for the development of them. The first prototype of TCA is under development and HA-PACS base cluster will be extended in early 2013 with 200+ TFLOPS of performance equipped with TCA.
Galaxy Collision and the Andromeda Stellar Stream
Masao Mori, Associate Professor, CCS Graduate School of Pure and Applied Sciences
Large spiral galaxies such as the Andromeda galaxy are believed to have formed in part from the merger of many less massive galaxies. We study the interaction between an accreting satellite and the Andromeda galaxy using a high-resolution N-body simulation with forty million particles. For the first time, we show the self-gravitating response of the disk, the bulge, and the dark matter halo of Andromeda galaxy to an accreting satellite galaxy. Our simulation suggests that the Andromeda stellar stream is the tidal debris formed in the last pericentric passage of a satellite galaxy on a radial orbit.
GPU-Accelerated Complex Data Mining
Toshiyuki Amagasa, Associate Professor, CCS Graduate School of Systems and Information Engineering
GPGPU (General Purpose computing on GPU) has recently been an interesting research subject in the field of high performance computing and many other fields. GPGPU implies using GPU (Graphics Processing Unit), which is originally designed for processing graphics, for general purpose computation. In this talk, we present our recent works on GPU-accelerated data mining over complex data. At first, we present a scheme to accelerate the Probabilistic Latent Semantic Indexing (PLSI), which is an automated document indexing method based on a statistical latent semantic model, exploiting the
high parallelism of GPU. In the next, we introduce a method for fast frequent itemset mining from uncertain databases using GPU. The idea is to accelerate probability computations by making the best use of GPU.
3D Free-Viewpoint Soccer Stadium Project
Itaru Kitahara, Associate Professor CCS Graduate School of Systems and Information Engineering
This talk introduces our real-time 3D modelling method of soccer players by using billboard technique. A real-time soccer player tracking method which can stably track soccer players by utilising their shadow regions will be also introduced. Since intuitive browsing interfaces are important to enjoy 3D free-viewpoint video, we developed a few interfaces to control the virtual camera which captures 3D video.
HECToR, the AMD Bulldozer architecture and scaling software for exascale
Dr Andy Turner, EPCC
In this presentation I will give a short introduction to the UK National Supercomputing Facility, HECToR (a 90,112 core Cray XE6 machine based on the AMD Bulldozer architecture) and the software used on the facility. First, I will survey the types of scientific software that are most used on HECToR. Then I will give a brief overview of the Bulldozer architecture and its implications for the scientific software used on HECToR. Finally, I will look at the challenges that will be faced in scientific software development with the changes in HPC architecture that are accompanying the push to exaflop resources.
Adrian Jackson, EPCC
Nu-FuSE is an international project (funded through the G8 Research Councils Initiative on Multilateral Research Funding) looking to significantly improve computational modelling capabilities to the level required by the new generation of fusion reactors.
The focus is on three specific scientific areas: fusion plasma; the materials from which fusion reactors are built; and the physics of the plasma edge. This will require computing at the “exascale” level across a range of simulation codes, collaborating together to work towards full integrated fusion tokamak modelling.
In this presentation I will outline the aims and goals of the Nu-FuSE project, highlight some of the approaches being taken to tackle exascsale computing for Fusion simulation, and discuss the different computers we have access to through the project partners.
QM/MM Studies on the reaction mechanisms in metalloenzymes
Shoji Mitsuo, Associate Professor CCS Graduate School of Pure and Applied Sciences
Metalloenzymes catalyze specific chemical reactions in vivo with high catalytic efficiency and high selectivity. Elucidations of the
molecular mechanisms are very important not only for the basic understandings of bio-systems, but also for various application fields
such as chemical synthesis and drug discoveries.
In this talk, I will present our recent progress of quantum mechanics/molecular mechanics (QM/MM) studies on reaction mechanisms
in metalloenzymes. We deeply appreciated HECToR resources, which are provided for the recovery from the Great East Japan Earthquake. We could study by using the HECToR resources.
Software Skills for Free-Range Researchers
Neil Chue Hong, Director, UK Software Sustainability Institute
Software is present at every part of the researchers daily life, from simulations and data analysis to spreadsheets and social media. Yet the way we approach the development of software skills is often constrained to two established models: “full-time student” and “self-learning”. With attitudes to data management, reproducible research and scientific programming all changing, it is important that researchers
This presentation examines the different reasons that researchers may want to acquire new software skills, describes new models which might
be better suited to their work schedules, and suggests changes that might be required at an institutional level to support this and nurture research capability.
The presenter is the Director of the UK Software Sustainability Institute (www.software.ac.uk) which is defining a roadmap for supporting scientific software development, and has previously published best practice on software preservation.
Building the next generation of HPC applications — a review of the NAIS and APOS projects
Dr. George Beckett, EPCC
High-performance computing is changing. The drive towards many-core processors, such as Intel’s 80-core MIC chip, and the growing significance of graphics accelerators, such as Nvidia’s CUDA platform, promise a step-change in capability for scientific computing. Researchers have the chance to run complex simulations at an unprecedented scale and speed. However, to realise this potential, scientists need to invest significant effort to align their applications to these new architectures. Looking beyond the hype, one can expect major code re-writes to be likely; and even wholesale algorithmic changes to not be ruled out. In this talk, we look at two projects that are tackling these challenges head on — NAIS and APOS.
NAIS (the centre for Numerical Algorithms and Intelligent Software) brings together application developers, computer scientists, and HPC experts, to develop new algorithms (plus re-discover some older ones) that are capable of exploiting massively parallel programming environments.
Complementing this, APOS (Application Performance Optimisation and Scalability) is looking at specific case-study applications from the strategically important areas of fusion energy, molecular modelling, oil recovery, and CFD. Using exemplar applications, the project team are developing new (and better) tools and techniques for adapting existing codes to accelerator-based and many-core platforms.
Scaling an Application to a Thousand GPUs and Beyond
Dr. Alan Gray, EPCC
The GPU architecture is inherently is more suitable for many types of intensive parallel computations than the traditional CPU, since it features a large number of lightweight cores offering a relatively high ratio of performance to power consumption. Subsequently, an increasing number of massively-parallel supercomputers are based on heterogeneous node architectures featuring CPUs coupled with GPUs as compute accelerators. Such systems provide a possible template for future exascale systems (for which power consumption will be a key issue). If an application is to sustain a reasonable fraction of peak performance that scales well across many nodes, it must overcome a number of specific barriers: as well as performance tuning of the individual GPU kernels, all inter-node communication of GPU-resident data involves additional PCIe bus transfers and, potentially, buffer copies in the host CPU memory.
The Ludwig lattice Boltzmann code is written in C with message-passing parallelism using MPI. It has been adapted for GPU utilisation using NVIDIA CUDA (the most common GPU programming model). Significant performance gains were realised on each GPU through restructuring of the data layout to allow memory coalescing and the adaptation of key loops to reduce off-chip memory accesses. The halo-swap communication phase has been designed to efficiently utilise many GPUs in parallel: including the overlapping of several stages using CUDA stream functionality. We present performance results on a prototype Cray XK6 hybrid supercomputer, which has nodes comprising an AMD Interlagos CPU and an NVIDIA Tesla 2090 GPU, coupled using the Cray Gemini interconnect. The GPU adaptation scales excellently all the way to 936 GPUs on the Cray XK6 prototype, the largest run as yet attempted. The overall GPU-accelerated performance is significantly better than that using an equivalent traditional CPU based CRAY XE6 supercomputer (ensuring maximal utilisation of all CPU cores).