Graduate and Postdoctoral Studies
Karthik Srinivasa Murthy
Code Generation for Extreme Scale Parallel Systems
Friday, March 3, 2017
to 12:00 PM
227 Abercrombie Engineering Laboratory
Power consumption and fabrication limitations are increasingly playing
significant roles in the design of extreme scale parallel systems.
These factors are influencing system designers to support higher on-node
computing capability via throughput-optimized processors instead of
latency-optimized processors. However, the inter- and intra-processor
communication capabilities on such systems are not increasing at the same
rate as the on-node computing capability.
Consequently, achieving high performance requires careful orchestration
of both single- and multiprocessor parallelism. This thesis shows that
compiler technology and expressive programming model constructs can help
applications more effectively exploit both forms of parallelism.
Compilers play an important role in harnessing short vector parallelism
supported by cores in modern processors. Over last ten years, vector
widths have increased dramatically from the 64-bit vectors supported by
Intel's Pentium MMX processor to the 512-bit vectors supported by Intel's
Knights Corner processor. However, the vectorization capabilities of
state-of-the-art compilers are still immature, failing in the presence of
complex control flow and data dependencies. This thesis presents compiler
transformations that enable efficient vector parallelism in the presence
of common kinds of complex dependencies.
To enable efficient multiprocessor parallelism, this thesis develops
compiler technology to support sophisticated algorithms that minimize
interprocessor communication. The class of .5D communication-avoiding
algorithms was developed to address the inter-processor communication
bottleneck. Mapping these algorithms to complex architectures efficiently
is tedious for even expert programmers. To address this issue, this
thesis presents the Maunam compiler, which generates efficient parallel
code from a high-level, global-view sketch of a .5D algorithm that is
expressed using symbolic data sizes and numbers of processors.
To mitigate the cost of communication for multiprocessor parallelism,
this thesis develops a novel compiler transformation to overlap
communication with computation for systolic computations.
Additionally, to aid effective management of the completion of
non-blocking communication, this thesis presents two synchronization
constructs, cofence and distributed phasers.