zur Hauptseite

Institute for Advanced Simulation (IAS)

Navigation und Service

Trainingskurs "From zero to hero: Understanding and fixing intra-node performance bottlenecks"

(Kurs-Nr. 1242018 im Trainingsprogramm 2018 des Forschungszentrums)

11.04.2018 09:00 Uhr
12.04.2018 16:30 Uhr
Jülich Supercomputing Centre, Ausbildungsraum 1, Geb. 16.3, R. 213a

Der Kurs ist ausgebucht. Weitere Interessierte kommen auf die Warteliste.

Wissenschaftler/Softwareentwickler, die die performanz-kritischen Hardwareaspekte moderner CPUs verstehen wollen (fortgeschrittener Kurs)

Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compiler cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. As scientists we therefore have the added responsibility to design generic data layouts and data access patterns to give the compiler a fighting chance to generate code utilizing most of the available hardware features. Such data layouts and access patterns are vital to utilize performance from vectorization/SIMDization. Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations.

But what happens if your problem does not fall into this category and 3rd-party libraries are not available? This training course will shed some light on how the goal of utilizing on-core performance and ultimatively performance portability can be achieved.

In the first part of the training course we want to give insights in today's CPU microarchitecture and apply this knowledge in the hands-on session. As a demonstrator we will use a simple Coulomb solver and improve the code step-by-step. We will start from a basic implementation and advance to an optimized version using hardware features like vectorization to increase performance.

The exercises will also contain training on the use of open-source tools to measure and understand the achieved performance. Such optimizations, however, depend heavily on the targeted hardware and should not be part of the algorithmic layer of the code.

In the second part we will present a detailed description of possible abstraction layers to hide such hardware-specifics and therefore maintain readability and maintainability. We will also discuss the overhead costs of our introduced abstraction and show compile-time SIMD configurations and corresponding performance results on different platforms.

Some covered topics:

  • Inside a CPU: A scientists view on modern CPU microarchitecture
  • Datastructures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA and JURECA Booster
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Data Reuse: Register file and cache-blocking
  • Compiler: When and how to use compiler optimization flags

If you ever asked yourself one of the following questions, this course is for you.

  • What is the performance of my code and how fast could it actually be?
  • Why is my performance so bad?
  • Does my code use SIMD?
  • Why does my code not use SIMD and why does the compiler not help me?
  • Is my data-structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is this so complicated, I thought the science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.

Linux (ssh), Command line tools (grep, less), Kenntnisse in Fortran, C, C++
Git: examples will be provided in a git repository
Editors: vim or emacs for easier/faster handling of performance data
Der Kurs wird auf Englisch gehalten.
2 Tage
11. - 12 April 2018, 9.00 - 16.30 Uhr
Jülich Supercomputing Centre, Ausbildungsraum 1, Geb. 16.3, Raum 213a
mindestens 5, höchstens 14
Andreas Beckmann, Dr. Ivo Kabadshow, JSC
Photo Andreas Beckmann
Andreas Beckmann
Telefon: +49 2461 61-8713

Der Kurs ist ausgebucht. Weitere Interessierte kommen auf die Warteliste.
Anfragen stellen Sie bitte an Andreas Beckmann.

Wenn Sie nicht Mitarbeiter des Forschungszentrums Jülich sind, geben Sie bei der Anmeldung bitte die folgenden Daten an:
Vorname, Name, Geburtsdatum, Nationalität, vollständige Adresse des Wohnorts, E-Mail-Adresse