Tuesday at 10:30 a.m.

Salle 465 (PCRI-N), Université Paris-Sud 11, Bât 650 Ada Lovelace, Rue Noetzlin, 91190 Gif-sur-Yvette. How to find us?

To receive the seminar announcements by email, please subscribe to the mailing list by either

- sending an email to seminaire-parsys-request@lri.fr with the subject line "subscribe" or
- contacting the seminar organizer Laércio Lima Pilla.

Distributed proofs are mechanisms enabling the nodes of a network to collectively and efficiently check the correctness of Boolean predicates on the structure of the network (e.g. having a specific diameter), or on data-structures distributed over the nodes (e.g. a spanning tree). We show that many network predicates have distributed proofs offering a high level of redundancy, explicitly or implicitly. We use this remarkable property of distributed proofs for establishing perfect tradeoffs between the size of the certificate stored at every node, and the number of rounds of the verification protocol. Joint work with Laurent Feuilloley, Pierre Fraigniaud, Juho Hirvone and Mor Perry.

Resilience represents one of the many challenges to be anticipated and addressed before viable exascale systems can emerge. In this context, I will present an approach to tackle resilience for the solving of deterministic and stochastic partial differential equations (PDEs). The approach is based on a domain decomposition framework, together with a sampling and optimization approach. In particular, the occurence of silent data corruption (SDC), modeled by random bit flips in the 64-bits representation of double precision variables, is overcome by means of robust regression techniques involving l1-minimization. In the case of an elliptic PDE with random coefficients, the stochastic quantities are approximated by a polynomial chaos (PC) surrogate and we then resort to a hybrid approach involving both sampling and Galerkin projection. Finally, I will present discrete a priori bounds on the solution of elliptic PDEs, used to enhance the resilient capabilities of the approach. This a priori knowledge is used to check the admissibility of local PDE solutions and to filter out corrupted data, thus making the subsequent regression more robust.

Technological advancements in integrated circuits and computer organization have been leading to better high performance computing (HPC) systems at the cost of more complex designs including specialized networks, hierarchical and distributed shared memory, accelerators with their own complex subsystems, among others. In these systems, achieving high performance and reliability have been important objectives for a while, and energy efficiency has also become an issue in recent years. With these objectives in mind, I have been working with collaborators in different scenarios by trying to understand the behavior of applications and the architecture of the computing systems where they execute, and by combining this information to identify issues and propose new solutions. In this presentation, I will focus on some of my experience with the development of machine topology-aware global scheduling algorithms, and I will discuss future plans related to hybrid-memory systems.

We consider the problem of delivery of a packet from a source node to a destination node in a graph, by a team of mobile robots that have very low battery capacity (two units of energy, with one unit spent each time an agent traverses an edge). The robots can collaborate to deliver the packet by exchanging energy or the packet itself whenever two of them are co-located. We study two variants of the problem: In the fixed placement variant, the robots are initially placed at some given nodes of the graph and the question is whether delivery is feasible. We prove that this problem is NP-complete. In the free placement variant, the initial positions of the robots are not fixed but a subset of nodes H of the graph is given as input together with an integer k, and the question is whether there exists a placement of k robots at nodes in H such that delivery is possible. We prove that this problem can be solved in polynomial time, by actually computing the minimum initial energy and the placement of robots at nodes in H that allows them to carry out the delivery.

(joint work with Shantanu Das, Dariusz Dereniowski, and Christina Karousatou)

Avec l’émergence du calcul haute-performance (HPC) et des applications Big Data, de nouvelles problématiques cruciales sont apparues. Parmi elles on trouve le problème du transfert de données, c’est-à-dire des communications entre machines, qui peut générer des délais lors de gros calculs en plus d’avoir un impact sur la consommation énergétique. Dans cette présentation nous nous intéresserons à la réduction des communications pour un problème particulier : le produit de matrices. Dans un premier temps nous nous intéressons à eux modélisation théoriques, basées respectivement sur le partitionnement d'un carré et d'un cube, ainsi qu'au algorithmes d’approximations existant pour résoudre ce problème. Dans un second temps nous nous intéresserons à la mise en application de ces algorithmes avec une implémentation pratique du produit de matrice sur une plate-forme hétérogène.

Tensors, or multi-dimensional arrays, have been increasingly used in the recent past in many application domains including signal processing, quantum chemistry, data analysis and machine learning. Tensor decompositions, generalization of matrix decompositions such as SVD to higher dimensions, are employed to find a low-rank representation of data in these applications. This in turn enables finding feasible solutions to the problem at hand with a proper interpretation of this compact representation. Specifically, sparse tensor decompositions are used in data analysis involving recommender systems, graph analytics, and anomaly detection in order to predict missing entries in the tensor, while dense tensor decompositions are heavily employed in signal processing for the detection of unknown signal sources. Recently, another promising use case of tensors have arised in the solution of linear systems and eigenvalue problems in higher dimensional problems where matrices and vectors can be expressed using low-rank tensor decompositions, and all matrix-vector operations can be carried out in this compressed form with tremendous computational and memory gains. In all these applications, computing tensor decompositions efficiently is indispensible for rendering tensor methods practical when dealing with data of massive scale. The focus of this talk is challenges encountered in accelerating the computation of sparse and dense tensor decompositions using effective shared/distributed memory parallelization, partitioning, data structures, and algorithms.

Finding a good partition of a computational directed acyclic graph associated with an algorithm can help find an execution pattern improving data locality, conduct an analysis of data movement, and expose parallel steps. The partition is required to be acyclic , i.e., the inter-part edges between the vertices from different parts should preserve an acyclic dependency structure among the parts. In this work, we adopt the multilevel approach with coarsening, initial partitioning, and refinement phases for acyclic partitioning of directed acyclic graphs and develop a direct k -way partitioning scheme. To the best of our knowledge, no such scheme exists in the literature. To ensure the acyclicity of the partition at all times, we propose novel and efficient coarsening and refinement heuristics. The quality of the computed acyclic partitions is assessed by computing the edge cut, the total volume of communication between the parts, and the critical path latencies. We use the solution returned by well-known undirected graph partitioners as a baseline to evaluate our acyclic partitioner, knowing that the space of solution is more restricted in our problem.

This paper presents Legolas++ arrays, a C++ multi-dimensional array library. Parameterized type of Legolas++ arrays enable data layout adaptation for specific Single Instruction Multiple Data (SIMD) core architectures. The mapping of complex array-based kernels to regular collections of data is efficiently vectorized. In addition, Legolas++ arrays can combine multi-threaded parallelism with SIMD acceleration. For example, a direct tridiagonal solver applied to a collection of equally sized problems exhibits a speedup of more than x22 on an 8-core SIMD processor.

In a distributed system the nodes affected by arbitrarily failures are modeled as Byzantine. In this model the set of Byzantine processes never changes during the time contrarily to what happens in a long lasting execution, where nodes can be affected and restored. The Mobile Byzantine Model aims to capture such kind of scenario. In this talk we present the different Mobile Byzantine Failures models that have been defined so far and the problems, Consensus, Approximate Agreement and Distributed Registers, that have been solved in such models. Besides that, we are going to underline the challenges and impossibilities that arises when we move from the Byzantine to the Mobile Byzantine Failures model.

Suppose to function correctly, a chemical system requires a single molecule of a certain species L. Preparing a solution with just a single copy of L is a difficult task to achieve with imprecise pipettors. Could we engineer artificial reactions (a chemical election algorithm, so to speak) that whittle down an initially large count of L to 1? Yes, with the reaction L+L→L+F: whenever two candidate leaders encounter each other, one drops out of the race. In volume v convergence to a single L requires expected time proportional to v; the final reaction --- two lone L's seeking each other in the vast expanse of volume v --- dominates the whole expected time.

One might hope that more cleverly designed reactions could elect a leader more quickly. We dash this hope: L+L→L+F, despite its sloth, is the fastest chemical algorithm for leader election there is (subject to some reasonable constraints on the reactions). The techniques generalize to establish lower bounds on the time required to do other distributed computing tasks, such as computing which of two species X or Y holds an initial majority, whether the initial number of X's is odd, or cutting the total number of X's in half.

Cet exposé porte sur l'optimisation d'algorithmes et leur implémentation sur carte graphique (GPU).

Après une introduction rappelant différents points cruciaux de la programmation GPU, l'exposé s'axe sur la réalisation de réductions vectorielles sur carte graphique. En se basant sur le cas des réductions de vecteur, l'exposé traite la problématique de la factorisation d'un grand nombre de matrices symétriques définies positives. Dans cette partie sont présentées différentes optimisations en accord avec les règles de la programmation sur GPU développées en introduction.

The seminar will start 15 minutes late, at **10:45 a.m.**

Ces dernières années deux grandes familles d’accélérateurs matériels ont été très répandues : les GPU (principalement de Nvidia) et les Xeon-phi (d’Intel). Les GPUs requièrent une algorithmique et une programmation spécifique alors que les Xeon-phi ne nécessitent théoriquement que des développements multithreads et vectoriels comme les CPU Xeon multi-cœurs pourvues d’unités (vectorielles) AVX.

Afin d’évaluer les performances mais aussi la difficulté de programmation optimisée de ces trois architectures (CPU, GPU et Xeon-phi), nous avons développé des applications pour les trois. Nous nous sommes notamment attardés sur la vectorisation des codes CPU et Xeon-phi à travers des codes sources adaptés et enrichis de directives de compilation.

Dans le but d’utiliser toute la puissance d’un nœud de calcul, nous avons également développé des solutions répartissant les calculs sur un CPU multi-cœurs et sur un accélérateur, ou même sur un CPU et deux accélérateurs différents en même temps (GPU et Xeon-phi). Dans tous les cas nous avons veillé à recouvrir au maximum les calculs et les transferts de données entre la mémoire CPU et les mémoires des accélérateurs, tout en recherchant une symétrie algorithmique maximale entre l’utilisation d’un GPU et celle d’un Xeon-phi.

Plus récemment, nous avons participé au développement d’un code CPU bound de Galerkin Discontinu 2D sur CPU, GPU et Xeon-phi, et avons pu mesurer et comparer les performances et les efficacités énergétiques de Xeon-phi KNC et KNL, de GPU GeForce Titan et Tesla, et de CPU multi-cœurs.

Cet exposé présentera nos solutions algorithmiques multi-devices et fera un retour d’expérience sur nos efforts de développements. Ces travaux ont été réalisés grâce à des collaborations avec Sylvain Contassot-Vivier du LORIA, Juliet Ryan et Ludomir Oteski de l’ONERA, et Guillaume Colin de Verdière du CEA DAM/DIF.

Exploring the use of Graphic Processor Unit (GPGPU) as a general purpose co-processor to increase the computational power has been an active research field since the release of the CUDA language in 2007. GPUs have now become an integrated component in devices like smartphones, games console or even modern CPUs. Integrated GPUs are an essential part of these chipsets as performance and energy efficiency are the main concerns. In this talk, we will discuss the programming model of integrated graphics and their advantages as we now have the newest NVIDIA Tegra system-on-chip available for the ParSys team.

The Knights Landing is the 2^nd generation of Intel Xeon Phi product. This new processor provides significant enhancement in performance compared to the previous generation. Among the new features of this self-booting CPU, we can name the binary compatibility with Xeon, power efficiency of the cores, new 512-bit vector instruction set, on-package high-bandwidth memory and integrated fabric. For this presentation, we will overview the main hardware characteristics of Knights Landing processors and their impact on the way we write code.

Because of locality of communication in space, a computing medium such as a Cellular Automata (CA) can scale abitrarily in size, thus providing a machine which power can grow arbitrarily. This same locality , however, makes the programming difficult, and -CA as a computing device- ... look more like a toy for mathematicians fond of funny images. They are mostly known for the game of life rule, which allthough universal, is light-years away from real practical computations. We will show that a CA rule can be real General Purpose, and even an efficient computing device. We use two main ingredient: 1- we have develop a compiler and a simulation platform allowing to program complex CA with radius of 20 and thousands of gates, using layers, and object oriented software, and an efficient execution exploiting the SIMD integer operations. 2- we obtain general purpose-ability by implementing SELF DEVELOPING SYSTEM.

The GPCA rule is not finalised yet, allthough we will be able to present several simulations, showing the feasibility. First, we show a CA rule able to do homogeneization of agents distributed on the CA medium. When the agents are hierachically organized, the homogeneisation still work on the hierarchy. Second we present the current status of our GPCA. It executes instructions send by a host, which propage through the medium like a wave. The first kind of instructions are local instructions: the effect is immediate as the instructions traverse a CA cell. Local instructions allow to select agents, and set registers. The second kind of instructions call collective instructions, define regions on the CA, in which a sustained activy takes blace by a succession of 4- 5 region-waves giving birth to each other, at specific places. They are the equivalent of the 4 - 5 cycles, needed to execute a machine instruction in a processor. Those instruction are micro-programmed, the micro-instructions are associated to each region wave. For the moment being we have implemented only 3 collective instructions, the easiest one: Election, Computation, and Duplication. We will show a development obtained by sending 8 Duplications to an initially single agent. It generates 256 offsprings, in only 600 iterations of the CA rules, which illustrates the scalability. The finalised GPCA will make use of 4 more elaborate instructions, whose functionning is described in detail in a report, but not yet implemented. The GPCA cell will have around 64 bits of memory, and 20,000 gates. For the moment, it has only 6000 gates.

We are used to think of computer as General-Purpose machine with arbitrary large memory. The GPCA is a General-Purpose machine with arbitrary large power.

In this work we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. We prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in the qr_mumps solver. Following this approach, we address heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources.

In this paper, we study the fundamental problem of counting, which consists in computing the size of a system. We consider the distributed communication model of population protocols of finite state, anonymous and asynchronous mobile devices (agents) communicating in pairs (according to a fairness condition). This work significantly improves the previous results known for counting in this model, in terms of (exact) space complexity. We present and prove correct the first space optimal protocols solving the problem for two classical types of fairness, global and weak. Both protocols require no initialization of the counted agents.

The protocol designed for global fairness, surprisingly, uses only one bit of memory (two states) per counted agent. The protocol, functioning under weak fairness, requires the necessary log P bits (P states, per counted agent) to be able to count up to P agents. Interestingly, this protocol exploits the intriguing Gros sequence of natural numbers, which is also used in the solutions to the Chinese Rings and the Hanoi Towers puzzles.

Computers are used in a wide range of applications, including embedded control systems in cars and aircraft, cloud computing, and the Internet. As we increasingly depend on the correct operation of such systems, it becomes crucial to design them in a way that ensures that they behave as expected. To do so, one has to address two problem areas: on the one hand, partial failure that is outside the control of a system designer (such as power outages or bit-flips due to radiation), and on the other hand, design faults (bugs). The former is classically addressed by means of replication and fault-tolerant distributed algorithms, while the latter is dealt with by rigorous software engineering methods, such as model checking. In order to maximize the reliability, one should deploy fault-tolerant distributed algorithms that have been verified by model checking. However, only very few distributed algorithms have been automatically verified, because many aspects of distributed algorithms, such as parameterization, faults, uncertain timing, etc. still pose research challenges for model checking.

In this talk I will discuss several verification techniques that we recently generalized and used for automatic verification of distributed algorithms. These techniques include parametric interval data and counter abstraction, offline partial order reductions, and acceleration.

Based on joint work with Annu Gmeiner, Igor Konnov, Helmut Veith, and Ulrich Schmid

Sparse triangular solvers are typically parallelized using level-scheduling techniques, but parallel efficiency is poor on high-throughput architectures like GPUs. This talk proposes an iterative approach for solving sparse triangular systems when an approximation is sufficient. Although not suitable for all problems, the approach can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. Significant performance gains can be achieved when using this approach for a preconditioned Krylov subspace method.

We investigate the approximate consensus problem in highly dynamic networks in which topology may change continually and unpredictably. We prove that in both synchronous and partially synchronous networks, approximate consensus is solvable if and only if the communication graph in each round has a rooted spanning tree. Interestingly, the class of averaging algorithms, which have the benefit of being memoryless and requiring no process identifiers, entirely captures the solvability issue of approximate consensus in that the problem is solvable if and only if it can be solved using any averaging algorithm.

After this characterization of solvability of approximate consensus, we turn to the question of time complexity. We give an algorithm that solves approximate consensus in time linear in the numer of nodes in the network and show the optimality of its time complexity up to a multiplicative constant. We also present its generalization to higher dimensional values and the case of finite memory, in which rounding procedures will be necessary.

This is joint work with Bernadette Charron-Bost and Matthias Függer.

Functions of matrices are widely used in science, engineering and the social sciences, due to the succinct and insightful way they allow problems to be formulated and solutions to be expressed. New applications involving matrix functions are regularly being found, ranging from small but difficult problems in medicine to huge, sparse systems arising in the solution of partial differential equations. In this talk I will outline the history of matrix functions, summarize some applications, and describe recent developments concerning computation of the matrix exponential and logarithm and their Fréchet derivatives.

**Biography:** Nick Higham is Richardson Professor of Applied Mathematics in the School of Mathematics, University of Manchester. His degrees (BA 1982, MSc 1983, PhD 1985) are from the University of Manchester, and he has held visiting positions at Cornell University and the Institute for Mathematics and its Applications, University of Minnesota. He is Director of Research within the School of Mathematics, Head of the Numerical Analysis Group, and was Director of the Manchester Institute for Mathematical Sciences (MIMS) 2004-2010. He was elected Fellow of the Royal Society in 2007, is a SIAM Fellow, and held a Royal Society-Wolfson Research Merit Award (2003-2008).
He is well known for his research on the accuracy and stability of numerical algorithms, and the second edition of his 700-page monograph on this topic was published by SIAM in 2002. His most recent book, Functions of Matrices: Theory and Computation (SIAM, 2008), is the first research monograph devoted to this topic. He is the Editor of the Princeton Companion to Applied Mathematics (2015, circa 1000 pages).
He has more than 100 refereed publications on topics such as rounding error analysis, linear systems, least squares problems, matrix functions and nonlinear matrix equations, condition number estimation, and polynomial eigenvalue problems. His research has been supported by grants from EPSRC and by fellowships from the Nuffield Foundation, the Royal Society and the Leverhulme Trust. He currently holds a 2M euro ERC Advanced Grant (2011-2016) supporting his research on matrix functions.
Higham is a member of the editorial boards of the journals Acta Numerica, Forum of Mathematics, Foundations of Computational Mathematics. IMA Journal of Numerical Analysis, Linear Algebra and Its Applications, and Numerical Algorithms, He is also (Founding) Editor-in-Chief of the SIAM Fundamentals of Algorithms book series.
He was Vice President at Large (2010-2013) of SIAM and has served for over ten years on the SIAM Board of Trustees and the SIAM Council. He has also served on the Board of Directors of the International Linear Algebra Society, and as Chair of the SIAM Activity Group on Linear Algebra. He is a frequent invited speaker at international conferences, served for 17 years on the (permanent) organizing committee of the Householder Symposia on Numerical Linear Algebra, and was a member of the Scientific Program Committee for ICIAM 2011.
Higham has contributed software to LAPACK and the NAG library, and has written numerous M-files included in MATLAB. His algorithms are also included in Julia, SciPy, Mathematica and other packages.
Honours include the Alston S. Householder Award VI, 1987 (for the best Ph.D. thesis in numerical algebra 1984--1987), the 1988 Leslie Fox Prize in Numerical Analysis, a 1999 Junior Whitehead Prize from the London Mathematical Society, designation as a "Highly Cited Researcher" by Thomson/ISI in 2006, and the 2008 Fröhlich Prize of the London Mathematical Society.
Higham is also author of the best-selling SIAM books Handbook of Writing for the Mathematical Sciences (2nd edition, 1998) and MATLAB Guide (with D. J. Higham, 2nd edition, 2005), and is a contributor to the popular Penguin Dictionary of Mathematics (fourth edition, 2008).

With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to checkpointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application's processes. In this talk, I will introduce the intra-parallelization technique, a solution that avoids replicating all computation by introducing work-sharing between replicas. I will also present SDR-MPI, our optimized implementation of active replication for MPI applications. Experiments on a representative set of benchmarks show that intra-parallelization together with SDR-MPI allow achieving more than 50% efficiency without compromising fault tolerance.

Currently, General purpose (GP)GPU programming is a popular solution to achieve high performance. It couples inexpensive highly parallel computing units with classic CPUs. These heterogenous systems lead to complex designs combining multiple paradigms and programming languages to manage each hardware architecture. This talk will present a set of tools to harness GPGPU programming through the OCaml programming language. It will describe the SPOC library, which handles GPGPU subprograms (kernels) and data transfers between devices. Then, It will show how SPOC expresses GPGPU kernels: through interoperability with common low-level extensions (from Cuda and OpenCL frameworks) but also via an embedded DSL forOCaml. Finally, I will present how to manage those kernels through parallel skeletons.

Après avoir rappelé les composantes de la puissance dissipée dans les processeurs et leur évolution avec les différents nœuds technologiques, nous montrons quels ont été les moteurs de l’évolution des performances avant l’apparition du « mur de la chaleur » puis après, avec l’arrêt de l’augmentation des fréquences d’horloge. Si l’évolution des architectures pour la haute performance est assez bien connue (multi-cœurs, many-cores, clusters), le large spectre des applications enfouies et embarquées conduit à des solutions différentes, avec notamment le retour vers des architectures de processeurs plus simples pour le bas de gamme.

La programmation parallèle est rendue obligatoire pour profiter des architectures actuelles, des smartphones aux super-calculateurs, car tous les processeurs contiennent des unités vectorielles et plusieurs coeurs. Certains outils permettent de paralléliser automatiquement du code mais les bibliothèques les plus optimisées utilisent généralement des outils de bas niveaux et des optimisations manuelles pour exploiter au mieux les capacités des processeurs et atteindre les meilleures performances.

Si les performances sont quantifiées, l'effort de développement nécessaire pour les atteindre en fonction des outils utilisés ne l'est généralement pas, de même que leur impact sur la consommation énergétique. Nous proposons de mesurer l'effort de développement à l'aide de métriques en fonction des approches de programmation parallèle utilisées et d'étudier le rapport entre effort de développement, performance et consommation énergétique.

The Iterated Immediate Snapshot model (IIS), due to its elegant geometrical representation, has become standard for applying topological reasoning to distributed computing. Its modular structure makes it easier to analyze than the more realistic (non-iterated) read-write Atomic-Snapshot memory model (AS).

In this talk we show that AS and IIS are equivalent with respect to distributed computability: a distributed function is computable in AS if and only if it is computable in IIS. Therefore, the computability of any distributed function in AS can be equivalently investigated in IIS. This is a joint work with Petr Kuznetsov and Eli Gafni to be presented at OPODIS 2014.

The growing complexity of modern computer architectures increasingly complicates the prediction of the run-time behavior of software. For real-time systems, where a safe estimation of the program’s worst-case execution time is needed, time-predictable computer architectures promise to resolve this problem. A stack cache, for instance, allows the compiler to efficiently cache a program’s stack, while static analysis of its behavior remains easy. Likewise, its implementation requires little hardware overhead.

This work introduces an optimization of the standard stack cache to avoid redundant spilling of the cache content to main memory, if the content was not modified in the meantime. At first sight, this appears to be an average-case optimization. Indeed, measurements show that the number of cache blocks spilled is reduced to about 17% and 30% in the mean, depending on the stack cache size. Furthermore, we show that lazy spilling can be analyzed with little extra effort, which benefits the worst-case spilling behavior that is relevant for a real-time system.

**Biography:** Florian Brandner is an Assistant Professor at ENSTA ParisTech working on the intersection of time-predictable computer architecture, static program analysis, worst-case execution time analysis, and compilers. He received is PhD in 2009 from the Vienna University of Technology, working on processor description languages and the automatic generation of machine-code generators. Then spend two years as a post-doc at INRIA / ENS de Lyon, working on register allocation. Before joining ENSTA, he worked within the FP7-STREP project T-CREST at the Technical University of Denmark on time-predictable computer architectures, networks-on-chip, static program analyses, and compilers for real-time systems.