PROGRAM HIGHLIGHTS

Keynote titled “The three L’s in modern high-performance networking: Low latency, Low cost, Low processing load”, will be given by Torsten Hoefler, ETH Zürich.
Six papers have been accepted for the technical sessions after a reviewing period conducted by 32 highly-reputed experts both from academia and industry.
Panel session: “Industrial perspective of high-speed communication technology evolution”, moderated by Dr. Young Cho, University of Southern California. The panelists will be:
– Eitan Zahavi, Mellanox Technologies, Israel
– Ola Torudbakken, Skala Norge AS, Norway
– Cyriel Minkenberg, Rockley Photonics, Ltd, Switzerland

PROGRAM AT-A-GLANCE

[08:30 – 8:45] Opening

[08:45 – 10:00] Keynote

The three L’s in modern high-performance networking: Low latency, Low cost, Low processing load (slides )
Torsten Hoefler (ETH Zürich, Switzerland)

[10:00 – 10:30] Coffee break

[10:30 – 12:00] Technical Session 1 (research papers)
Each presentation has 25 minutes, plus 5 minutes for questions from the audience

Analysis and improvement of Valiant routing in low-diameter networks (slides )
Mariano Benito, Pablo Fuentes, Enrique Vallejo and Ramon Beivide (University of Cantabria, Spain)

Node-type-based load-balancing routing for Parallel Generalized Fat-Trees (slides )
John Gliksberg, Jean-Noël Quintin and Pedro Javier García García (Atos BULL, France)

Analyzing topology parameters for achieving energy-efficient k-ary n-cubes (slides )
Francisco J. Andújar, Salvador Coll, Marina Alonso, Juan-Miguel Martínez, Pedro López, Francisco J. Alfaro, Jose L. Sánchez and Raúl Martínez (Technical University of Valencia, Spain)

[12:00 – 13:30] Lunch

[13:30 – 15:00] Technical Session 2 (research papers)
Each presentation has 25 minutes, plus 5 minutes for questions from the audience

Evaluating Energy Saving Strategies on Torus, K-Ary N-Tree, and Dragonfly (slides )
Felix Zahn, Armin Schäffer and Holger Fröning (Ruprecht-Karls University of Heidelberg, Germany)

VEF3 traces: towards a complete framework for modelling network workloads for exascale systems (slides )
Javier Cano-Cano, Francisco J. Andújar, Francisco J. Alfaro and Jose L. Sánchez (University of Castilla-La Mancha, Spain)

Improving the Efficiency of Future Exascale Systems with rCUDA (slides )
Carlos Reaño, Javier Prades and Federico Silla (Technical University of Valencia, Spain)

[15:00 – 15:30] Coffee break

[15:30 – 17:00] Panel Session

Industrial perspective of high-speed communication technology evolution (slides )
Moderated by Prof. Young Cho, University of Southern California.

Panelists:

- Eitan Zahavi, Mellanox Technologies, Israel (slides )
- Ola Torudbakken, Skala Norge AS, Norway (slides )
- Cyriel Minkenberg, Rockley Photonics Inc., Switzrland (slides )

DETAILED PROGRAM

KEYNOTE

The three L’s in modern high-performance networking: Low latency, Low cost, Low processing load

Torsten Hoefler, Associate Professor, Scalable Parallel Computing Lab
Computer Science Department, ETH Zürich

Abstract: This talk provides an overview of recent research results in high-performance networking. We discuss the history and design tradeoffs for large-scale topologies following the growing demand for low latency and high throughput at lowest cost. We then introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. We analyze Slim Fly and compare it to both traditional and state-of-the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. After solving the topology problem, we focus on the endpoint. Today’s network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We describe sPIN, a portable programming model to offload simple packet processing functions to the network card. The portable packet-processing network acceleration sPIN model is similar to compute acceleration with CUDA or OpenCL. We demonstrate several use-cases for which network acceleration enables an eco-system that can significantly speed up applications and system services. Both of these recent results will guide design and implementation of future data-center and HPC networks.

Torsten Hoefler is an Assistant Professor of Computer Science at ETH Zürich, Switzerland. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC. He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the “Collective Operations and Topologies” working group. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference SC10, SC13, SC14, EuroMPI 2013, IPDPS 2015, and other conferences. He published numerous peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. He received the Latsis award of ETH Zurich as well as an ERC starting grant in 2015. His research interests revolve around the central topic of “Performance-centric System Design” and include scalable networks, parallel programming techniques, and performance modeling. Additional information about Torsten can be found on his homepage at htor.inf.ethz.ch.

TECHNICAL SESSIONS

Mariano Benito, Pablo Fuentes, Enrique Vallejo and Ramon Beivide: Analysis and improvement of Valiant routing in low-diameter networks

Abstract: Valiant routing randomizes network traffic to avoid pathological congestion issues by diverting traffic to a random intermediate switch. It has received significant attention in recently proposed high-radix, low-diameter topologies, which are prone to congestion issues. It has been implemented obliviously, or as the basis of some non-minimal adaptive routing algorithms. An analysis of the original mechanism identifies two potential improvements regarding the selection of the intermediate switch. First, when traffic is local the randomization introduced by Valiant results in unnecessarily long paths. Instead, the introduced Restricted Valiant routing randomizes traffic within a local partition, avoiding congestion and generating shorter paths. Second, in certain cases the path to the selected random intermediate node can be blocked; a version with recomputation selects a new random intermediate node as long as the associated path remains stalled.

The proposals are evaluated by simulation in a state-of-the art Dragonfly network with different traffic patterns. Results show that Restricted Valiant is highly effective in cases of local traffic, with a small improvement under global patterns. Valiant with recomputation increases injection, further reducing average latency and increasing throughput. However, the higher injection increases congestion effects in some cases. Such problem is emphasized when more injection buffers are added, because of the increased pressure on the interconnect. Overall, the results are very relevant for routing in high-radix networks and might constitute the basis for other adaptive routing algorithms.

John Gliksberg, Jean-Noël Quintin and Pedro Javier García García: Node-type-based load-balancing routing for Parallel Generalized Fat-Trees

Abstract: High-Performance Computing (HPC) clusters are made up of a variety of node types (usually compute, I/O, service, and GPGPU nodes) and applications don’t use nodes of a different type the same way. Resulting communication patterns reflect organization of groups of nodes, and current optimal routing algorithms for all-to-all patterns will not always maximize performance for group-specific communications. Since application communication patterns are rarely available beforehand, we choose to rely on node types as a good guess for node usage. We provide a description of node type heterogeneity and analyse performance degradation caused by unlucky repartition of nodes of the same type. We provide an extension to routing algorithms for Parallel Generalized Fat-Tree topologies (PGFTs) which balances load amongst groups of nodes of the same type. We show how it removes these performance issues by comparing results in a variety of situations against corresponding classical algorithms.

Francisco J. Andújar, Salvador Coll, Marina Alonso, Juan-Miguel Martínez, Pedro López, Francisco J. Alfaro, Jose L. Sánchez and Raúl Martínez: Analyzing topology parameters for achieving energy-efficient k-ary n-cubes

Abstract: Achieving an optimal performance/energy ratio is a challenge for massively parallel computer architects, and in particular for the interconnection network designers. The kary n-cube is one of the most popular topologies used in the largest current supercomputers. In this paper, we present a study that considers two alternatives to build k-ary n-cube topologies taking advantage of the high-radix switches currently available: a topology with more dimensions and one NIC per router, or a topology with less dimensions, link aggregation and several NICs per router. The fact of using link aggregation eases the implementation of simple power consumption reduction techniques. Using a simple power model, we evaluate by trace-driven simulation the impact on energy and performance of several network sizes for both topology proposals. In order to do a fair comparison, we keep fixed the theoretical network bandwidth.

Felix Zahn, Armin Schäffer and Holger Fröning: Evaluating Energy Saving Strategies on Torus, K-Ary N-Tree, and Dragonfly

Abstract: Energy is one of the most crucial factors in the design of large-scale computing systems, especially high-performance computing. While exascale systems could be built with current hardware solutions, the required funding exceeds the budget of most institutions. Since a system is never fully utilized, energy-proportional components can save a substantial amount of energy. However, current interconnect technologies still operate at a fixed power consumption rate. Therefore, network power consumption becomes increasingly important as its contribution to overall power consumption is increasing. Energy-proportional interconnection networks is a research area that is still emerging. In this work, we analyze the effects of different topology characteristics on power consumption and potential energy savings of interconnection networks. We compare the differences in the design of common topologies and the related impact to energy savings. In particular, we analyze the power consumption of torus, k-ary n-tree, and dragonfly. We also use existing topology-independent power-saving policies to derive potential energy savings for each topology and compare the policies to other work which is specific to topology hardware features. The comparison concludes that topology-independent policies are superior for energy savings and the other work is superior for execution time.

Javier Cano-Cano, Francisco J. Andújar, Francisco J. Alfaro and Jose L. Sánchez: VEF3 traces: towards a complete framework for modelling network workloads for exascale systems

Abstract: To meet the expected performance requirements of applications running on future exascale systems, the number of processing nodes included in such systems will have to increase and, according to the current trend, also the number of cores in each node. In these systems, the networks, both off- and on-chip, interconnecting these nodes and cores inside nodes, respectively, will have to be much more efficient than current ones. In order to develop and research on interconnection networks, simulation is the most common technique used. Simulators traditionally have used synthetic traffic as network workload which does not represent the network workload that real applications generate. The use of application communication trace files is a best strategy for this purpose. In this paper, we extend an existing tool including functionality related to communication within each node. In this way, the tool will allow interconnection network simulators to model traffic due to all the communications generated in the exascale systems.

Carlos Reaño, Javier Prades and Federico Silla: Improving the Efficiency of Future Exascale Systems with rCUDA

Abstract: The computing power of supercomputers and data centers has noticeably grown during the last decades at the cost of an ever increasing energy demand. The need for energy (and power) of these facilities has finally limited the evolution of high performance computing, making that many researchers are concerned not only about performance but also about energy efficiency. However, despite the many concerns about energy consumption, the search for computing power continues. In this regard, the research on exascale systems, able to deliver 10¹⁸ floating point operations per second, has reached a widely consensus that these systems should operate within a maximum power budget of 20 megawatts. Many efficiency improvements are necessary for achieving this goal. One of these improvements is the usage of ARM low-power processors, as the MontBlanc proposes. In this paper we propose the combined use of ARM processors with the remote GPU virtualization rCUDA framework as a way to improve efficiency even more. Results show that it is possible to speed up applications by more than 12x when rCUDA is used to access high-end GPUs.

PANEL

Industrial perspective of high-speed communication technology evolution

Moderator: Dr. Young Cho, Research Assistant Professor of Computer Science in University of Southern California, Viterbi School of Engineering, and the USC Information Sciences Institute (ISI).

Young H. Cho (PI) is a Research Assistant Professor in USC and a Computer Scientist at the Networking division of ISI. He leads the USC/ISI Underwater Testbed initiative funded under NSF ORTUN and DATURNR. In order to support the micro-tomography work, he initiated the development of the physical level event-driven simulator for UWASN, SeaSim2D. He is the PI of the Rapid Problem Detection project which developed self-sustaining extremely low-power wireless sensor network. He is also co-PI for Green Edge Network project (NSF/GEN) which surrounds the research in conserving energy by actively monitoring and managing computer network and power activities. He also leads research effort in computer network intrusion detection and prevention using embedded systems including FPGAs, SoC, and microcontrollers as well as design in VLSI. He also has an extensive industrial experience in developing high performance networking interfaces, switches, and custom networking appliances. Role: The PI will be responsible for overall direction of the project, and will lead research in all three objectives. He will report the research results and publish papers. Dr. Cho will supervise one post-doctoral scholar and one Ph.D. student for this project. He also intends to recruit undergraduate and M.S. students for directed research.

Panelists:

Eitan Zahavi manages the Mellanox end-to-end performance architecture group which focuses on features that improve the overall system performance for both Ethernet and InfiniBand, lossy and lossless. We also study Optical Data Center networks. Example fields of research are Application performance, Congestion Control, Adaptive Routing, Tenants Isolation, and Topologies. The group employs large system simulation and lab experiments to validate our hypothesis and test new features implementations.

Ola Torudbakken is a recognised industry expert in high-performance network fabrics and system designs. Mr. Torudbakken holds an MS degree in Computer Science from University of Oslo. Mr. Torudbakken holds 36 patents, has published several papers in leading publications and conferences such as IEEE Communications, and participated in numerous standardisation bodies including PCI-SIG and IBTA. Previously he was Chief Architect of Networking at Oracle, and drove technical leadership and product development of Oracle networking products, including Oracle´s family of engineered systems. Torudbakken joined Oracle in 2009 through the Sun acquisition. At Sun he was a Distinguished Engineer and supervised among others development of the largest switch ever made – the Magnum 3456-port 110Tbps switch. Prior to Sun was Architect and Engineering Project Manager at Dolphin ICS.

Cyriel Minkenberg is a System Architect at Rockley Photonics, Ltd., where he is presently responsible for Platform Architecture and Technical Communication. Before joining Rockley Photonics in 2015, he was a Research Staff Member at IBM Research – Zurich, Switzerland. From 2010 to 2014, he managed the System Fabrics group, which focused on the architecture and performance evaluation of a distributed 100G Ethernet datacenter switch fabric, the standardization of IEEE 802 Data Center Bridging, the design of interconnection networks for High-Performance Computing, and network virtualization. Cyriel obtained MSc and PhD degrees from the Eindhoven University of Technology, The Netherlands.