Tuesday at 10:30 a.m.

Salle 465 (PCRI-N), Université Paris-Sud 11, Bât 650 Ada Lovelace, Rue Noetzlin, 91190 Gif-sur-Yvette. How to find us?

To receive the seminar announcements by email, please subscribe to the mailing list by either

- sending an email to seminaire-parsys-request@lri.fr with the subject line "subscribe" or
- contacting the seminar organizer Laércio Lima Pilla.

Load imbalance is a recurring problem in High Performance Computing (HPC), which leads to suboptimal performance via the under-use of available resources. As computing systems grow larger, resource management and load balancing become a costly process, especially for dynamic applications that demand periodical workload balance. With this in mind, we believe that future generation load balancing algorithms should look towards scaling along computing systems. In order to express solutions to the aforementioned issues, we propose a distributed scheduling model based on large-scale parallel machines and HPC systems. Additionally, we present a task packing model to minimize the decision costs of distributed algorithms, and present two scheduling strategies implemented in the Charm++ runtime system that use said model.

Atomic broadcast is a group communication primitive to order messages across a set of distributed processes. This primitive is used as a basic building block of many key online services today. Atomic multicast is its natural generalization where each message is addressed to a subset of the system nodes. This talk studies genuine solutions to this problem where only the sender and destinations of a message may take computation steps to deliver it. We show that the weakest failure detector to solve atomic multicast depends on the topology of the destination groups and present several possible candidates.

In this talk, I outline the notion of local checkability and point at various ways this has proven useful. Local checking (or local verification, or local decision, or local detection, or...) and locally checkable predicates (or languages, or labeling, or ...) were suggested (by Afek, K., and Yuvan, 1990) in the context of self-stabilization. The motivation was the "easy" detection that program malfunctions so that it can be restarted. It was contrasted with the "costly" (i.e. global) checking suggested by Katz and Perry.

It was pointed out that this notion also resembles "easy" checking (or decision, or ...), i.e. polynomial vs. "costly" computing. Indeed, the use of this notion has spread beyond self-stabilization to various fields such as (distributed) complexity theory, distributed testing and peer to peer computing.

This seminar aims at presenting the current research at LRI on quantum computing and quantum programming. This research concerns in particular the following topics:

- Optimization of the synthesis of quantum circuits (compilation phase) via methods and algorithms from high-performance scientific computing,

- Definition of a high-level language for quantum algorithms using formal methods.

2 PhD students and 2 postdocs will present their work via short presentations, accessible to non-specialists.

If you are interested in the subject, or just curious, you are very welcome to attend. Below is the program.

9:30-10:00: **Quantum circuit synthesis using linear algebra and optimization Algorithms**, by Timothée Goubault de Brugière (PhD student in ParSys team).

10:00-10:30: **Formalization of the semantics of quantum programming languages**, by Dong-Ho Lee (PhD student in VALS team).

10:30-11:00: **Introduction to quantum (walks)**, by Christopher Cedzich (Postdoc in ParSys team).

11:00-11:30: **A logic for recursive quantum programs**, by Zhaowei Xu (Postdoc in VALS team).

We will survey the relations between three classic models of distributed computing: the local model, the congest model, and the congested clique model. We will then outline two links between the congested clique model, and two other computational models. The first is a reduction from Boolean circuits, which sheds some light on the inability to prove lower bounds for the congested clique. The second is an adaptation of parallel matrix multiplication algorithms to the congested clique, a result which allows fast distance computation and triangle detection.

Asynchronous implementations present a mechanism to improve the parallel performance of iterative linear system solution methods on highly parallel computing platforms by removing the synchronization overhead. In this talk, after a brief overview of asynchronous computations, we consider a class of asynchronous iterative linear system solvers that employ randomization to determine the component update orders, specifically focusing on the effects of non-uniform distributions. Results show that using such distributions may lead to a faster convergence than that when components are selected for updates uniformly. Finally, we note on the performance of hybrid parallel implementations of asynchronous methods.

Escherichia coli (E. coli) is the most widely studied prokaryotic organism, and an important species served as host organism in microbiology. It is known that M13, a phage that infects E. coli, infection does not kill the host cell, and it leads researcher to study the dynamics of M13 infection of E. coli so that we understand and manipulate their cellular behavior. We develop a realistic computational model of E. coli infected with M13 phages, whose reactions are related to underlying biological mechanisms with measurable quantities. We match model predictions to experimental results, and the simulation result shows the growth of the infected E. coli and production of phages with corresponding phage coat proteins. Furthermore, we mention that the M13 infection model can be applied to implement a fault-tolerant cells communication model via M13 phages in a distributed approach.

Chance constrained optimization, also called probabilistic constrained optimization, is one of the main topics in stochastic optimization for dealing with random parameters in optimization problems. A chance constraint involving random parameters should hold with a prescribed probability which is in practice close to one. The use of chance constraints goes back to the late fifties. They have been widely applied in statistics, finance, engineering, energy planning etc. In this talk, we consider an n-player strategic game with a finite action set and random payoffs for each player. The payoff vector of each player follows a multivariate elliptically symmetric distribution. We assume that each player uses satisficing payoff criterion defined by a chance-constraint, i.e., players face a chance-constrained game. We show that there always exists a mixed strategy Nash equilibrium for this game. We also consider an n-player strategic game with random payoffs where the distribution of the payoff vector of each player is not completely known and belongs to a certain distributional uncertainty set. We formulate this game as a distributionally robust chance constrained game by defining a player's payoff function using a worst-case chance constraint. We show that there exists a mixed strategy Nash equilibrium for certain distributional uncertainty sets, and proposed an equivalent optimization problem for each case to compute the Nash equilibria.

We consider decision problems that are solved in a distributed fashion by synchronous mobile agents operating in an unknown, anonymous network. Each agent has a unique identifier and an input string and they have to decide collectively a property which may involve their input strings, the graph on which they are operating, and their particular starting positions. Building on recent work by Fraigniaud and Pelc [J. Parallel Distrib. Comput, vol. 109, pp. 117–128], we introduce several natural new computability classes allowing for a finer classification of problems below MAV or its complement class co - MAV, the former being the class of problems that are verifiable when the agents are provided with an appropriate certificate. We provide inclusion and separation results among all these classes. We also determine their closure properties with respect to set-theoretic operations. Our main technical tool, which may be of independent interest, is a new meta- protocol that enables the execution of a possibly infinite number of mobile agent protocols essentially in parallel, similarly to the well-known dovetailing technique from classical computability theory. (joint work with David Ilcinkas)

Distributed proofs are mechanisms enabling the nodes of a network to collectively and efficiently check the correctness of Boolean predicates on the structure of the network (e.g. having a specific diameter), or on data-structures distributed over the nodes (e.g. a spanning tree). We show that many network predicates have distributed proofs offering a high level of redundancy, explicitly or implicitly. We use this remarkable property of distributed proofs for establishing perfect tradeoffs between the size of the certificate stored at every node, and the number of rounds of the verification protocol. Joint work with Laurent Feuilloley, Pierre Fraigniaud, Juho Hirvone and Mor Perry.

Resilience represents one of the many challenges to be anticipated and addressed before viable exascale systems can emerge. In this context, I will present an approach to tackle resilience for the solving of deterministic and stochastic partial differential equations (PDEs). The approach is based on a domain decomposition framework, together with a sampling and optimization approach. In particular, the occurence of silent data corruption (SDC), modeled by random bit flips in the 64-bits representation of double precision variables, is overcome by means of robust regression techniques involving l1-minimization. In the case of an elliptic PDE with random coefficients, the stochastic quantities are approximated by a polynomial chaos (PC) surrogate and we then resort to a hybrid approach involving both sampling and Galerkin projection. Finally, I will present discrete a priori bounds on the solution of elliptic PDEs, used to enhance the resilient capabilities of the approach. This a priori knowledge is used to check the admissibility of local PDE solutions and to filter out corrupted data, thus making the subsequent regression more robust.

Technological advancements in integrated circuits and computer organization have been leading to better high performance computing (HPC) systems at the cost of more complex designs including specialized networks, hierarchical and distributed shared memory, accelerators with their own complex subsystems, among others. In these systems, achieving high performance and reliability have been important objectives for a while, and energy efficiency has also become an issue in recent years. With these objectives in mind, I have been working with collaborators in different scenarios by trying to understand the behavior of applications and the architecture of the computing systems where they execute, and by combining this information to identify issues and propose new solutions. In this presentation, I will focus on some of my experience with the development of machine topology-aware global scheduling algorithms, and I will discuss future plans related to hybrid-memory systems.

We consider the problem of delivery of a packet from a source node to a destination node in a graph, by a team of mobile robots that have very low battery capacity (two units of energy, with one unit spent each time an agent traverses an edge). The robots can collaborate to deliver the packet by exchanging energy or the packet itself whenever two of them are co-located. We study two variants of the problem: In the fixed placement variant, the robots are initially placed at some given nodes of the graph and the question is whether delivery is feasible. We prove that this problem is NP-complete. In the free placement variant, the initial positions of the robots are not fixed but a subset of nodes H of the graph is given as input together with an integer k, and the question is whether there exists a placement of k robots at nodes in H such that delivery is possible. We prove that this problem can be solved in polynomial time, by actually computing the minimum initial energy and the placement of robots at nodes in H that allows them to carry out the delivery.

(joint work with Shantanu Das, Dariusz Dereniowski, and Christina Karousatou)

Avec l’émergence du calcul haute-performance (HPC) et des applications Big Data, de nouvelles problématiques cruciales sont apparues. Parmi elles on trouve le problème du transfert de données, c’est-à-dire des communications entre machines, qui peut générer des délais lors de gros calculs en plus d’avoir un impact sur la consommation énergétique. Dans cette présentation nous nous intéresserons à la réduction des communications pour un problème particulier : le produit de matrices. Dans un premier temps nous nous intéressons à eux modélisation théoriques, basées respectivement sur le partitionnement d'un carré et d'un cube, ainsi qu'au algorithmes d’approximations existant pour résoudre ce problème. Dans un second temps nous nous intéresserons à la mise en application de ces algorithmes avec une implémentation pratique du produit de matrice sur une plate-forme hétérogène.

Tensors, or multi-dimensional arrays, have been increasingly used in the recent past in many application domains including signal processing, quantum chemistry, data analysis and machine learning. Tensor decompositions, generalization of matrix decompositions such as SVD to higher dimensions, are employed to find a low-rank representation of data in these applications. This in turn enables finding feasible solutions to the problem at hand with a proper interpretation of this compact representation. Specifically, sparse tensor decompositions are used in data analysis involving recommender systems, graph analytics, and anomaly detection in order to predict missing entries in the tensor, while dense tensor decompositions are heavily employed in signal processing for the detection of unknown signal sources. Recently, another promising use case of tensors have arised in the solution of linear systems and eigenvalue problems in higher dimensional problems where matrices and vectors can be expressed using low-rank tensor decompositions, and all matrix-vector operations can be carried out in this compressed form with tremendous computational and memory gains. In all these applications, computing tensor decompositions efficiently is indispensible for rendering tensor methods practical when dealing with data of massive scale. The focus of this talk is challenges encountered in accelerating the computation of sparse and dense tensor decompositions using effective shared/distributed memory parallelization, partitioning, data structures, and algorithms.

Finding a good partition of a computational directed acyclic graph associated with an algorithm can help find an execution pattern improving data locality, conduct an analysis of data movement, and expose parallel steps. The partition is required to be acyclic , i.e., the inter-part edges between the vertices from different parts should preserve an acyclic dependency structure among the parts. In this work, we adopt the multilevel approach with coarsening, initial partitioning, and refinement phases for acyclic partitioning of directed acyclic graphs and develop a direct k -way partitioning scheme. To the best of our knowledge, no such scheme exists in the literature. To ensure the acyclicity of the partition at all times, we propose novel and efficient coarsening and refinement heuristics. The quality of the computed acyclic partitions is assessed by computing the edge cut, the total volume of communication between the parts, and the critical path latencies. We use the solution returned by well-known undirected graph partitioners as a baseline to evaluate our acyclic partitioner, knowing that the space of solution is more restricted in our problem.

This paper presents Legolas++ arrays, a C++ multi-dimensional array library. Parameterized type of Legolas++ arrays enable data layout adaptation for specific Single Instruction Multiple Data (SIMD) core architectures. The mapping of complex array-based kernels to regular collections of data is efficiently vectorized. In addition, Legolas++ arrays can combine multi-threaded parallelism with SIMD acceleration. For example, a direct tridiagonal solver applied to a collection of equally sized problems exhibits a speedup of more than x22 on an 8-core SIMD processor.

In a distributed system the nodes affected by arbitrarily failures are modeled as Byzantine. In this model the set of Byzantine processes never changes during the time contrarily to what happens in a long lasting execution, where nodes can be affected and restored. The Mobile Byzantine Model aims to capture such kind of scenario. In this talk we present the different Mobile Byzantine Failures models that have been defined so far and the problems, Consensus, Approximate Agreement and Distributed Registers, that have been solved in such models. Besides that, we are going to underline the challenges and impossibilities that arises when we move from the Byzantine to the Mobile Byzantine Failures model.

Suppose to function correctly, a chemical system requires a single molecule of a certain species L. Preparing a solution with just a single copy of L is a difficult task to achieve with imprecise pipettors. Could we engineer artificial reactions (a chemical election algorithm, so to speak) that whittle down an initially large count of L to 1? Yes, with the reaction L+L→L+F: whenever two candidate leaders encounter each other, one drops out of the race. In volume v convergence to a single L requires expected time proportional to v; the final reaction --- two lone L's seeking each other in the vast expanse of volume v --- dominates the whole expected time.

One might hope that more cleverly designed reactions could elect a leader more quickly. We dash this hope: L+L→L+F, despite its sloth, is the fastest chemical algorithm for leader election there is (subject to some reasonable constraints on the reactions). The techniques generalize to establish lower bounds on the time required to do other distributed computing tasks, such as computing which of two species X or Y holds an initial majority, whether the initial number of X's is odd, or cutting the total number of X's in half.

Cet exposé porte sur l'optimisation d'algorithmes et leur implémentation sur carte graphique (GPU).

Après une introduction rappelant différents points cruciaux de la programmation GPU, l'exposé s'axe sur la réalisation de réductions vectorielles sur carte graphique. En se basant sur le cas des réductions de vecteur, l'exposé traite la problématique de la factorisation d'un grand nombre de matrices symétriques définies positives. Dans cette partie sont présentées différentes optimisations en accord avec les règles de la programmation sur GPU développées en introduction.

The seminar will start 15 minutes late, at **10:45 a.m.**

Ces dernières années deux grandes familles d’accélérateurs matériels ont été très répandues : les GPU (principalement de Nvidia) et les Xeon-phi (d’Intel). Les GPUs requièrent une algorithmique et une programmation spécifique alors que les Xeon-phi ne nécessitent théoriquement que des développements multithreads et vectoriels comme les CPU Xeon multi-cœurs pourvues d’unités (vectorielles) AVX.

Afin d’évaluer les performances mais aussi la difficulté de programmation optimisée de ces trois architectures (CPU, GPU et Xeon-phi), nous avons développé des applications pour les trois. Nous nous sommes notamment attardés sur la vectorisation des codes CPU et Xeon-phi à travers des codes sources adaptés et enrichis de directives de compilation.

Dans le but d’utiliser toute la puissance d’un nœud de calcul, nous avons également développé des solutions répartissant les calculs sur un CPU multi-cœurs et sur un accélérateur, ou même sur un CPU et deux accélérateurs différents en même temps (GPU et Xeon-phi). Dans tous les cas nous avons veillé à recouvrir au maximum les calculs et les transferts de données entre la mémoire CPU et les mémoires des accélérateurs, tout en recherchant une symétrie algorithmique maximale entre l’utilisation d’un GPU et celle d’un Xeon-phi.

Plus récemment, nous avons participé au développement d’un code CPU bound de Galerkin Discontinu 2D sur CPU, GPU et Xeon-phi, et avons pu mesurer et comparer les performances et les efficacités énergétiques de Xeon-phi KNC et KNL, de GPU GeForce Titan et Tesla, et de CPU multi-cœurs.

Cet exposé présentera nos solutions algorithmiques multi-devices et fera un retour d’expérience sur nos efforts de développements. Ces travaux ont été réalisés grâce à des collaborations avec Sylvain Contassot-Vivier du LORIA, Juliet Ryan et Ludomir Oteski de l’ONERA, et Guillaume Colin de Verdière du CEA DAM/DIF.

Exploring the use of Graphic Processor Unit (GPGPU) as a general purpose co-processor to increase the computational power has been an active research field since the release of the CUDA language in 2007. GPUs have now become an integrated component in devices like smartphones, games console or even modern CPUs. Integrated GPUs are an essential part of these chipsets as performance and energy efficiency are the main concerns. In this talk, we will discuss the programming model of integrated graphics and their advantages as we now have the newest NVIDIA Tegra system-on-chip available for the ParSys team.

The Knights Landing is the 2^nd generation of Intel Xeon Phi product. This new processor provides significant enhancement in performance compared to the previous generation. Among the new features of this self-booting CPU, we can name the binary compatibility with Xeon, power efficiency of the cores, new 512-bit vector instruction set, on-package high-bandwidth memory and integrated fabric. For this presentation, we will overview the main hardware characteristics of Knights Landing processors and their impact on the way we write code.

Because of locality of communication in space, a computing medium such as a Cellular Automata (CA) can scale abitrarily in size, thus providing a machine which power can grow arbitrarily. This same locality , however, makes the programming difficult, and -CA as a computing device- ... look more like a toy for mathematicians fond of funny images. They are mostly known for the game of life rule, which allthough universal, is light-years away from real practical computations. We will show that a CA rule can be real General Purpose, and even an efficient computing device. We use two main ingredient: 1- we have develop a compiler and a simulation platform allowing to program complex CA with radius of 20 and thousands of gates, using layers, and object oriented software, and an efficient execution exploiting the SIMD integer operations. 2- we obtain general purpose-ability by implementing SELF DEVELOPING SYSTEM.

The GPCA rule is not finalised yet, allthough we will be able to present several simulations, showing the feasibility. First, we show a CA rule able to do homogeneization of agents distributed on the CA medium. When the agents are hierachically organized, the homogeneisation still work on the hierarchy. Second we present the current status of our GPCA. It executes instructions send by a host, which propage through the medium like a wave. The first kind of instructions are local instructions: the effect is immediate as the instructions traverse a CA cell. Local instructions allow to select agents, and set registers. The second kind of instructions call collective instructions, define regions on the CA, in which a sustained activy takes blace by a succession of 4- 5 region-waves giving birth to each other, at specific places. They are the equivalent of the 4 - 5 cycles, needed to execute a machine instruction in a processor. Those instruction are micro-programmed, the micro-instructions are associated to each region wave. For the moment being we have implemented only 3 collective instructions, the easiest one: Election, Computation, and Duplication. We will show a development obtained by sending 8 Duplications to an initially single agent. It generates 256 offsprings, in only 600 iterations of the CA rules, which illustrates the scalability. The finalised GPCA will make use of 4 more elaborate instructions, whose functionning is described in detail in a report, but not yet implemented. The GPCA cell will have around 64 bits of memory, and 20,000 gates. For the moment, it has only 6000 gates.

We are used to think of computer as General-Purpose machine with arbitrary large memory. The GPCA is a General-Purpose machine with arbitrary large power.

In this work we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. We prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in the qr_mumps solver. Following this approach, we address heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources.

In this paper, we study the fundamental problem of counting, which consists in computing the size of a system. We consider the distributed communication model of population protocols of finite state, anonymous and asynchronous mobile devices (agents) communicating in pairs (according to a fairness condition). This work significantly improves the previous results known for counting in this model, in terms of (exact) space complexity. We present and prove correct the first space optimal protocols solving the problem for two classical types of fairness, global and weak. Both protocols require no initialization of the counted agents.

The protocol designed for global fairness, surprisingly, uses only one bit of memory (two states) per counted agent. The protocol, functioning under weak fairness, requires the necessary log P bits (P states, per counted agent) to be able to count up to P agents. Interestingly, this protocol exploits the intriguing Gros sequence of natural numbers, which is also used in the solutions to the Chinese Rings and the Hanoi Towers puzzles.

Computers are used in a wide range of applications, including embedded control systems in cars and aircraft, cloud computing, and the Internet. As we increasingly depend on the correct operation of such systems, it becomes crucial to design them in a way that ensures that they behave as expected. To do so, one has to address two problem areas: on the one hand, partial failure that is outside the control of a system designer (such as power outages or bit-flips due to radiation), and on the other hand, design faults (bugs). The former is classically addressed by means of replication and fault-tolerant distributed algorithms, while the latter is dealt with by rigorous software engineering methods, such as model checking. In order to maximize the reliability, one should deploy fault-tolerant distributed algorithms that have been verified by model checking. However, only very few distributed algorithms have been automatically verified, because many aspects of distributed algorithms, such as parameterization, faults, uncertain timing, etc. still pose research challenges for model checking.

In this talk I will discuss several verification techniques that we recently generalized and used for automatic verification of distributed algorithms. These techniques include parametric interval data and counter abstraction, offline partial order reductions, and acceleration.

Based on joint work with Annu Gmeiner, Igor Konnov, Helmut Veith, and Ulrich Schmid

Sparse triangular solvers are typically parallelized using level-scheduling techniques, but parallel efficiency is poor on high-throughput architectures like GPUs. This talk proposes an iterative approach for solving sparse triangular systems when an approximation is sufficient. Although not suitable for all problems, the approach can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. Significant performance gains can be achieved when using this approach for a preconditioned Krylov subspace method.

We investigate the approximate consensus problem in highly dynamic networks in which topology may change continually and unpredictably. We prove that in both synchronous and partially synchronous networks, approximate consensus is solvable if and only if the communication graph in each round has a rooted spanning tree. Interestingly, the class of averaging algorithms, which have the benefit of being memoryless and requiring no process identifiers, entirely captures the solvability issue of approximate consensus in that the problem is solvable if and only if it can be solved using any averaging algorithm.

After this characterization of solvability of approximate consensus, we turn to the question of time complexity. We give an algorithm that solves approximate consensus in time linear in the numer of nodes in the network and show the optimality of its time complexity up to a multiplicative constant. We also present its generalization to higher dimensional values and the case of finite memory, in which rounding procedures will be necessary.

This is joint work with Bernadette Charron-Bost and Matthias Függer.

Functions of matrices are widely used in science, engineering and the social sciences, due to the succinct and insightful way they allow problems to be formulated and solutions to be expressed. New applications involving matrix functions are regularly being found, ranging from small but difficult problems in medicine to huge, sparse systems arising in the solution of partial differential equations. In this talk I will outline the history of matrix functions, summarize some applications, and describe recent developments concerning computation of the matrix exponential and logarithm and their Fréchet derivatives.

**Biography:** Nick Higham is Richardson Professor of Applied Mathematics in the School of Mathematics, University of Manchester. His degrees (BA 1982, MSc 1983, PhD 1985) are from the University of Manchester, and he has held visiting positions at Cornell University and the Institute for Mathematics and its Applications, University of Minnesota. He is Director of Research within the School of Mathematics, Head of the Numerical Analysis Group, and was Director of the Manchester Institute for Mathematical Sciences (MIMS) 2004-2010. He was elected Fellow of the Royal Society in 2007, is a SIAM Fellow, and held a Royal Society-Wolfson Research Merit Award (2003-2008).
He is well known for his research on the accuracy and stability of numerical algorithms, and the second edition of his 700-page monograph on this topic was published by SIAM in 2002. His most recent book, Functions of Matrices: Theory and Computation (SIAM, 2008), is the first research monograph devoted to this topic. He is the Editor of the Princeton Companion to Applied Mathematics (2015, circa 1000 pages).
He has more than 100 refereed publications on topics such as rounding error analysis, linear systems, least squares problems, matrix functions and nonlinear matrix equations, condition number estimation, and polynomial eigenvalue problems. His research has been supported by grants from EPSRC and by fellowships from the Nuffield Foundation, the Royal Society and the Leverhulme Trust. He currently holds a 2M euro ERC Advanced Grant (2011-2016) supporting his research on matrix functions.
Higham is a member of the editorial boards of the journals Acta Numerica, Forum of Mathematics, Foundations of Computational Mathematics. IMA Journal of Numerical Analysis, Linear Algebra and Its Applications, and Numerical Algorithms, He is also (Founding) Editor-in-Chief of the SIAM Fundamentals of Algorithms book series.
He was Vice President at Large (2010-2013) of SIAM and has served for over ten years on the SIAM Board of Trustees and the SIAM Council. He has also served on the Board of Directors of the International Linear Algebra Society, and as Chair of the SIAM Activity Group on Linear Algebra. He is a frequent invited speaker at international conferences, served for 17 years on the (permanent) organizing committee of the Householder Symposia on Numerical Linear Algebra, and was a member of the Scientific Program Committee for ICIAM 2011.
Higham has contributed software to LAPACK and the NAG library, and has written numerous M-files included in MATLAB. His algorithms are also included in Julia, SciPy, Mathematica and other packages.
Honours include the Alston S. Householder Award VI, 1987 (for the best Ph.D. thesis in numerical algebra 1984--1987), the 1988 Leslie Fox Prize in Numerical Analysis, a 1999 Junior Whitehead Prize from the London Mathematical Society, designation as a "Highly Cited Researcher" by Thomson/ISI in 2006, and the 2008 Fröhlich Prize of the London Mathematical Society.
Higham is also author of the best-selling SIAM books Handbook of Writing for the Mathematical Sciences (2nd edition, 1998) and MATLAB Guide (with D. J. Higham, 2nd edition, 2005), and is a contributor to the popular Penguin Dictionary of Mathematics (fourth edition, 2008).

With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to checkpointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application's processes. In this talk, I will introduce the intra-parallelization technique, a solution that avoids replicating all computation by introducing work-sharing between replicas. I will also present SDR-MPI, our optimized implementation of active replication for MPI applications. Experiments on a representative set of benchmarks show that intra-parallelization together with SDR-MPI allow achieving more than 50% efficiency without compromising fault tolerance.

Currently, General purpose (GP)GPU programming is a popular solution to achieve high performance. It couples inexpensive highly parallel computing units with classic CPUs. These heterogenous systems lead to complex designs combining multiple paradigms and programming languages to manage each hardware architecture. This talk will present a set of tools to harness GPGPU programming through the OCaml programming language. It will describe the SPOC library, which handles GPGPU subprograms (kernels) and data transfers between devices. Then, It will show how SPOC expresses GPGPU kernels: through interoperability with common low-level extensions (from Cuda and OpenCL frameworks) but also via an embedded DSL forOCaml. Finally, I will present how to manage those kernels through parallel skeletons.

Après avoir rappelé les composantes de la puissance dissipée dans les processeurs et leur évolution avec les différents nœuds technologiques, nous montrons quels ont été les moteurs de l’évolution des performances avant l’apparition du « mur de la chaleur » puis après, avec l’arrêt de l’augmentation des fréquences d’horloge. Si l’évolution des architectures pour la haute performance est assez bien connue (multi-cœurs, many-cores, clusters), le large spectre des applications enfouies et embarquées conduit à des solutions différentes, avec notamment le retour vers des architectures de processeurs plus simples pour le bas de gamme.

La programmation parallèle est rendue obligatoire pour profiter des architectures actuelles, des smartphones aux super-calculateurs, car tous les processeurs contiennent des unités vectorielles et plusieurs coeurs. Certains outils permettent de paralléliser automatiquement du code mais les bibliothèques les plus optimisées utilisent généralement des outils de bas niveaux et des optimisations manuelles pour exploiter au mieux les capacités des processeurs et atteindre les meilleures performances.

Si les performances sont quantifiées, l'effort de développement nécessaire pour les atteindre en fonction des outils utilisés ne l'est généralement pas, de même que leur impact sur la consommation énergétique. Nous proposons de mesurer l'effort de développement à l'aide de métriques en fonction des approches de programmation parallèle utilisées et d'étudier le rapport entre effort de développement, performance et consommation énergétique.

The Iterated Immediate Snapshot model (IIS), due to its elegant geometrical representation, has become standard for applying topological reasoning to distributed computing. Its modular structure makes it easier to analyze than the more realistic (non-iterated) read-write Atomic-Snapshot memory model (AS).

In this talk we show that AS and IIS are equivalent with respect to distributed computability: a distributed function is computable in AS if and only if it is computable in IIS. Therefore, the computability of any distributed function in AS can be equivalently investigated in IIS. This is a joint work with Petr Kuznetsov and Eli Gafni to be presented at OPODIS 2014.

The growing complexity of modern computer architectures increasingly complicates the prediction of the run-time behavior of software. For real-time systems, where a safe estimation of the program’s worst-case execution time is needed, time-predictable computer architectures promise to resolve this problem. A stack cache, for instance, allows the compiler to efficiently cache a program’s stack, while static analysis of its behavior remains easy. Likewise, its implementation requires little hardware overhead.

This work introduces an optimization of the standard stack cache to avoid redundant spilling of the cache content to main memory, if the content was not modified in the meantime. At first sight, this appears to be an average-case optimization. Indeed, measurements show that the number of cache blocks spilled is reduced to about 17% and 30% in the mean, depending on the stack cache size. Furthermore, we show that lazy spilling can be analyzed with little extra effort, which benefits the worst-case spilling behavior that is relevant for a real-time system.

**Biography:** Florian Brandner is an Assistant Professor at ENSTA ParisTech working on the intersection of time-predictable computer architecture, static program analysis, worst-case execution time analysis, and compilers. He received is PhD in 2009 from the Vienna University of Technology, working on processor description languages and the automatic generation of machine-code generators. Then spend two years as a post-doc at INRIA / ENS de Lyon, working on register allocation. Before joining ENSTA, he worked within the FP7-STREP project T-CREST at the Technical University of Denmark on time-predictable computer architectures, networks-on-chip, static program analyses, and compilers for real-time systems.