IMPACT 2020

10th International Workshop on Polyhedral Compilation Techniques

January 22, 2020 | Bologna, Italy

In conjunction with HiPEAC 2020, January 20-22, 2020

With multicore processors and deep memory hierarchies remaining the main source of performance, and emerging areas such as energy-aware computation requiring research breakthroughs, polyhedral compilation techniques receive more attention than ever from both researchers and practitioners. Thanks to a unified formalism for parallelization and memory optimization, these techniques are now powering domain-specific language compilers yielding best performance in highly competitive areas such as machine learning and numeric simulation. IMPACT is a unique event focusing exclusively on polyhedral compilation that brings together researchers and practitioners for a high-quality one-day event including a keynote, selected paper presentations, and posters for work-in-progress discussions.

Program

Opening and Keynote (10:00 - 11:00)

Welcome: Guillaume Iooss (ENS Paris) and Philippe Clauss (University of Strasbourg)
[Slides]
Keynote: High-Performance Compilation Opportunities in MLIR
Uday Bondhugula (Indian Institute of Science)
[Abstract]
This talk is primarily on the high-performance compilation opportunities provided by the MLIR project. The MLIR project was initiated to deliver the next generation optimizing compiler infrastructure with a focus on serving the computational demands of AI and machine learning programming models. It was announced and open-sourced by Google in Apr 2019. MLIR is a new intermediate representation designed to provide a unified, modular, and extensible infrastructure to progressively lower dataflow compute graphs, through loop nests potentially, to high-performance target-specific code. MLIR shares similarities with traditional CFG-based three-address SSA representations (including LLVM IR or Swift intermediate language), but also introduces notions from the polyhedral compiler framework as first class concepts to allow powerful analysis and transformation in the presence of loop nests and multi-dimensional arrays. MLIR supports multiple front- and back-ends and uses LLVM IR as one of its primary code generation targets. It is thus a very useful infrastructure for developing new compilers (both general-purpose and domain-specific), especially to solve the compilation challenges involved in targeting emerging AI and machine learning programming languages/models to the plethora of specialized accelerator chips. While MLIR has larger ambitious goals, this talk will particularly focus on the exciting opportunities MLIR provides to bring to bear polyhedral techniques in a principled way in a production setting and have a lasting impact.
[Bio]
Uday Bondhugula is an associate professor of computer science at the Indian Institute of Science (IISc), Bangalore, and the founder of a startup, PolyMage Labs. His research interests lie in the areas of high-performance computing, compilers, polyhedral framework, automatic parallelization, and in high-performance systems/accelerators for deep learning and artificial intelligence. He is the original author and maintainer of Pluto, a source-to-source loop parallelization and optimization tool based on the polyhedral framework. He recently received the ACM SIGPLAN Most Influential Paper award for his PLDI 2008 paper on polyhedral optimization for parallelism and locality. As a visiting researcher at the Google brain team in 2018, he was a founding team member of the MLIR project. He received his Ph.D. from the Ohio State University in 2008, and his Bachelors' in Computer Science from the Indian Institute of Technology, Madras in 2004.
[Slides]

Break (11:00 — 11:30)

Session 1: Polyhedral techniques for new architectures (11:30 — 13:00)

Paper: TC-CIM: Empowering Tensor Comprehensions for Computing In-Memory
Andi Drebes, Lorenzo Chelini, Oleksandr Zinenko, Albert Cohen, Henk Corporaal, Tobias Grosser, Kanishkan Vadivel and Nicolas Vasilache
[Abstract]
Memristor-based, non-von-Neumann architectures performing tensor operations directly in memory are a promising approach to address the ever-increasing demand for energy-efficient, high-throughput hardware accelerators for Machine Learning (ML) inference. A major challenge for the programmability and exploitation of such Computing-In-Memory (CIM) architectures consists in the efficient map-ping of tensor operations from high-level ML frameworks to fixed-function hardware blocks implementing in-memory computations.
We demonstrate the programmability of memristor-based accelerators with TC-CIM, a fully-automatic, end-to-end compilation flow from Tensor Comprehensions, a mathematical notation for tensor operations, to fixed-function memristor-based hardware blocks. Operations suitable for acceleration are identified usingLoop Tactics, a declarative framework to describe computational patterns in a polyhedral representation. We evaluate our compilation flow on a system-level simulator based on Gem5, incorporating crossbar arrays of memristive devices. Our results show that TC-CIM reliably recognizes tensor operations commonly used in ML workloads across multiple benchmarks in order to offload these operations to the accelerator.
[Paper] [Slides]
Paper: Generating SIMD Instructions for Cerebras CS-1 using Polyhedral Compilation Techniques
Sven Verdoolaege, Manjunath Kudlur, Rob Schreiber and Harinath Kamepalli
[Abstract]
The Cerebras CS-1 is a computing system based on a wafer-scale processor having nearly 400,000 compute cores. It is intended for training of and inference on deep neural networks. The architecture has several features specifically designed for this and related fields. One of these is a sophisticated SIMD engine that can mimic a rectangular loop nest of depth at most four. In order to achieve optimal performance, it is crucial to use SIMD instructions as much as possible.
This paper describes a high-level polyhedral compiler that takes a high-level algorithm description that can be written manually or extracted from a TensorFlow computation graph and generates input to the low-level C-based compiler. In this intermediate code, the use of SIMD instructions is made explicit. The main focus of the paper is the generation of these CS-1 SIMD instructions for convolution style algorithms. What complicates the task is that the set of computation instances that need to be performed may not at first sight look like they form a rectangular loop nest. The basis of the compilation is formed by an effective combination of relatively well-known, but more specialized polyhedral operations.
[Paper] [Slides]
Paper: Bounded Stream Scheduling in Polyhedral OpenStream
Nuno Miguel Nobre, Andi Drebes, Graham Riley and Antoniu Pop
[Abstract]
We consider OpenStream, a streaming dataflow language which supports the specification of concurrent tasks that communicate through streams. Streams, in the spirit of classical process networks, have no restrictions on their size. In order to deploy an OpenStream program on a chip, however, the size of the streams has to be bounded. This constricts the range of runtime behavior by restricting the schedules to a subset of parallel executions where the required memory never surpasses the available resources. In this paper, we exploit an approach that, conservatively, certifies that augmenting the intrinsic dataflow dependencies of the program with stream bounding constraints does not deadlock the program: it cannot show the existence of a deadlock but can give a certificate for the absence thereof. The aim of this work is to study the limitations of this stream bounding strategy and to demonstrate how it can currently be used to determine if an OpenStream program can execute under the particular memory constraints of a given architecture.
[Paper] (old) [Slides]

Break (13:00 — 14:00)

Session 2: Tool Demonstration and Short Papers (14:00 — 14:30)

Tool Demonstration: Farkas Lemma made easy.
Christophe Alias
[Abstract]
In this paper, we present FKCC, a scripting tool to prototype program analyses and transformations exploiting the affine form of Farkas lemma. Our language is general enough to prototype in a few lines sophisticated termination and scheduling algorithms. The tool is freely available and may be tried online via a web interface. We believe that FKCC is the missing chain to accelerate the development of program analyses and transformations exploiting the affine form of Farkas lemma.
[Short paper] [Slides] [Demo] [Tool Demonstrator]
Short paper: md_poly: A Performance-Portable Polyhedral Compiler based on Multi-Dimensional Homomorphisms.
Ari Rasch, Richard Schulze and Sergei Gorlatch
[Abstract]
Polyhedral compilers automatically parallelize sequential programs for multi- and many-core architectures, such as CPU and GPU. However, parallel code generated by state-of-the-art polyhedral compilers often lacks performance portability, because the existing compilers are usually optimized toward only a single particular architecture (e.g., GPU). Moreover, even on their target architecture, polyhedral compilers sometimes fail to reach high performance, because they often miss important optimizations, e.g., efficiently exploiting fast memory resources.
We present our work-in-progress results for md_poly - a novel polyhedral compiler that generates portable high-performance code from sequential C programs with perfect loop nests and rectangular iteration domains. In contrast to the existing polyhedral compilers, md_poly relies on the code generation approach for Multi-Dimensional Homomorphisms (MDHs): we show that the internal program representation of polyhedral compilers (a.k.a. polyhedral model) can be automatically transformed into an equivalent MDH representation; this representation is suitable for generating high-performance program code that is performance portable over different architectures. Our preliminary experimental comparison against PPCG with two benchmarks - Gaussian Convolution and Matrix Multiplication - shows encouraging results: speedups up to 7× on Intel CPU and 3× on NVIDIA GPU on real-world input sizes from deep learning.
[Short paper] [Slides]

Session 3: Theoretical advances (14:30 — 15:30)

Paper: Scalable Polyhedral Compilation, Syntax vs. Semantics: 1-0 in the First Round.
Riyadh Baghdadi and Albert Cohen
[Abstract]
The development of lightweight polyhedral compilation algorithms opens polyhedral loop transformation, parallelization and code generation to a larger class or programs. The Pluto scheduling algorithm plays a major role in state-of-the-art polyhedral compilers, aiming for the simultaneous enhancement of locality and the exploitation of coarse-grain parallelism through loop tiling. Reducing the run time of affine scheduling algorithms like Pluto has a significant impact on the overall compilation time of polyhedral compilers. Several approaches have been proposed to reduce the run time of affine scheduling while preserving most of the optimization opportunities. Yet these works have taken separate rather than consolidated attempts at the problem. In an attempt to better characterize the potential and limitations of such approaches, we introduce and evaluate a family of techniques called offline statement clustering. Program statements are clustered into macro-statements and the dependence graph is projected onto these macro-statements before affine scheduling. Offline statement clustering integrates transparently into the flow of a state-of-the-art polyhedral compiler and can reduce the scheduling time by a factor of 6 (median) without inducing a significant loss in optimization opportunities. We also study the theoretical and experimental properties of statement clustering, shedding new light on the leading syntax-driven heuristic. Our work-in-progress study confirms the surprising finding that the simpler, apparently more fragile and syntax-dependent methods tend to perform well on a wide range of benchmarks.
[Paper] [Slides]
Paper: Uniform Random Sampling in Polyhedra
Benoît Meister and Philippe Clauss
[Abstract]
We propose a method for generating uniform samples among a domain of integer points defined by a polyhedron in a multi-dimensional space. The method extends to domains defined by parametric polyhedra, in which a subset of the variables are symbolic. We motivate this work by a list of applications for the method in computer science. The proposed method relies on polyhedral ranking functions, as well as a recent inversion method for them, named trahrhe expressions.
[Paper] [Slides]

Break (15:30 — 16:00)

Session 4: Issues on loop parallelization (16:00 — 17:00)

Paper: Pipelined Multithreading Generation in a Polyhedral Compiler
Harenome Razanajato, Cédric Bastoul and Vincent Loechner
[Abstract]
State-of-the-art automatic polyhedral parallelizers extract and express parallelism as isolated parallel loops. For example, the Pluto high-level compiler generates and annotates loops with “#pragma omp parallel for” directives. Our goal is to take advantage of pipelined multithreading, a parallelization strategy allowing to address a wider class of codes, currently not handled by automatic parallelizers.
Pipelined multithreading requires to interlace iterations of some loops in a controlled way that enables the parallel execution of these iterations. We achieve this using OpenMP clauses such as ordered and nowait. The sketch of our method is to: (1) schedule a SCoP using traditional techniques such as Pluto’s algorithm; (2) detect potential pipelines in groups of sequential loops; (3) fine-tune the schedule; and (4) generate the resulting code.
The fully automatic generation is ongoing work, yet we show on a small set of experiments how pipelined multithreading permits to parallelize programs which would otherwise not be parallelized.
[Paper] [Slides]
Paper: Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels
Toufik Baroudi, Vincent Loechner and Rachid Seghir
[Abstract]
The polyhedral model permits to automatically improve data locality and enable parallelism of regular linear algebra kernels. In previous work we have proposed a new data structure, 2d-packed layout, to store only the non-zeros elements of regular sparse (triangular and banded) matrices dynamically allocated for different basic linear algebra operations, and used Pluto to parallelize and optimize them. To our surprise, there were huge discrepancies in our measures of these kernels execution times that were due to the allocation mode: as statically declared arrays or as dynamically allocated arrays of pointers.
In this paper we compare the performance of various linear algebra kernels, including some linear algebra kernels from the PolyBench suite, using different array allocation modes. We present our detailed investigation of the possible reasons of the performance variation on two different architectures: a dual 12-cores AMD (Magny-Cours) and a dual 10-cores Intel Xeon (Haswell-EP).
We conclude that static or dynamic memory allocation has an impact on performance in many cases, and that the processor architecture and the gcc compiler’s decisions can provoke significant and sometimes surprising variations, in favor of one or the other allocation mode.
[Paper] [Slides] [Code]

Important Dates

Abstract deadline:	~~October 27, 2019~~ November 3, 2019 (AoE)
Paper deadline:	~~November 3, 2019~~ November 10, 2019 (AoE)
Notification of decision:	~~December 2, 2019~~ December 5, 2019
Final version due:	~~December 15, 2019~~ December 20, 2019
Workshop:	January 22, 2020

Call For Paper Information

We welcome both theoretical and experimental papers on all aspects of polyhedral compilation and optimization. We also welcome submissions describing preliminary results, crazy new ideas, position papers, experience reports, and available tools, with an aim to stimulate discussions, collaborations, and advances in the field. The following illustrate potential IMPACT papers:

Thorough theoretical discussion of a preliminary idea with an attempt to place it in context but no experimental results.
Experimental results comparing two or more existing ideas, followed by a detailed analysis.
Presentation of an existing idea in a different way, including illustrations of how the idea applies to new use cases, code, architectures, etc. Attribution should as clear as possible.

Topics of interest include, but are not limited to:

program optimization (automatic parallelization, tiling, etc.);
code generation;
data/communication management on GPUs, accelerators and distributed systems;
hardware/high-level synthesis for affine programs;
static analysis;
program verification;
model checking;
theoretical foundations of the polyhedral model;
extensions of the polyhedral model;
scalability and robustness of polyhedral compilation techniques.

Submissions

Submissions should not exceed 10 pages (recommended 8 pages), excluding references, formatted as per ACM SIGPLAN proceedings format. Please use version 1.54 or above of the following templates to prepare your manuscript: ACM Proceedings Template. Make sure to use the sigplan subformat. Visit the SIGPLAN Author Resources page for further information on SIGPLAN manuscript formatting.
NOTE: older versions of the article template use smaller fonts for submission, please double-check that you are using the recent style file (in particular, various LaTeX distributions ship older versions of the acmart style file, please download the most recent one from ACM).

Submissions should use PDF format and be printable on US Letter or A4 paper. Please submit your manuscripts through EasyChair: https://easychair.org/conferences/?conf=impact2020.

Proceedings will be posted online. If the final version of an accepted paper does not sufficiently address the comments of the reviewers, then it may be accompanied by a note from the program committee. Publication at IMPACT will not prevent later publication in conferences or journals of the presented work. However, simultaneous submission to IMPACT and other workshop, conference, or journal is often prohibited by the policy of other venues. For instance, a manuscript overlapping significantly with IMPACT submission cannot be submitted to PLDI 2020 or any other overlapping SIGPLAN event.

Organizers

Philippe Clauss	(University of Strasbourg, Inria, France)
Guillaume Iooss	(ENS Paris, Inria, France)

Program Committee

Paul Feautrier	(ENS Paris, Inria, France)
Cédric Bastoul	(University of Strasbourg, France)
Albert Cohen	(Google, France)
Sanjay Rajopadhye	(Colorado State University, USA)
Ponnuswamy Sadayappan	(University of Utah, USA)
François Irigoin	(Mines ParisTech, France)
Paul Kelly	(Imperial College, London)
Louis-Noël Pouchet	(Colorado State University, USA)
Sven Verdoolaege	(Cerebras Systems, Belgium)
Benoit Meister	(Reservoir Labs, USA)
Vincent Loechner	(University of Strasbourg, France)
Tomofumi Yuki	(University of Rennes/IRISA, Inria, France)
David Wonnacott	(University of Haverford, USA)
Tobias Grosser	(ETH Zurich, Switzerland)
Oleksandr Zinenko	(Google, France)
Aravind Sukumaran Rajam	(Washington State University, USA)
Benoit Pradelle	(Silexica, Germany)
Sriram Krishnamoorthy	(Pacific Northwest National Lab, USA)

Contact us

Please send any questions to Philippe Clauss and Guillaume Iooss to impact-chairs@lists.gforge.inria.fr.