PROGRAM HIGHLIGHTS
The HiPINEB workshop comprises this year the following activities:
- Keynote will be given by Prof. Dhabaleswar K. (DK) Panda (Ohio State University, USA).
- Research Papers: research papers will be presented
- Invited talks given by Prof. John Kim (KAIST, South Korea), Prof. Lizhong Chen (Oregon State University, USA), and Prof. Torsten Hoefler (ETH Zürich, Switzerland).
PROGRAM AT A GLANCE
Room: Scarlet Oak
8:15 – 8:30am | Opening |
8:30 – 10:00am | Keynote:
RDMA-Based Networking Technologies and Middleware for Next-Generation Clusters and Data Centers |
10:00 – 10:30am | Break |
10:30 – 12:00am | Research papers:
Shortest paths in Dragonfly systems Effects of Congestion Management on Energy Saving Techniques in Interconnection Networks Invited Talk Revisiting the Dragonfly Topology in High-Performance Interconnection Networks |
12:00 – 1:00pm | Lunch |
1:00 – 3:00pm | Invited talks:
Routerless Network-on-Chip and Its Optimizations by Deep Reinforcement Learning Hardware implementations of streaming Processing in the Network NICs |
3:00-3:15pm | Closing remarks |
PROGRAM DESCRIPTION
KEYNOTE
RDMA-Based Networking Technologies and Middleware for Next-Generation Clusters and Data Centers
Prof. Dhabaleswar K. (DK) Panda, The Ohio State University, USA
This talk will focus on emerging technologies and middleware for designing next-generation clusters and data centers with high-performance and scalability. The role and significance of RDMA technology with InfiniBand, RoCE (v1 and v2), and Omni-Path will be presented. Challenges in designing high-performance middleware for running HPC, Big Data and Deep Learning applications on these systems while exploiting the underlying networking features will be focused. On the HPC front, RDMA-based designs for MPI and PGAS libraries on modern clusters with GPGPUs will be presented. An overview of RDMA-based designs for Spark, Hadoop, HBase, and Memcached will be presented. On the Deep Learning side, RDMA-based designs for popular Deep Learning frameworks such as TensorFlow, Caffe, and CNTK will be focused. The talk will conclude with challenges in providing efficient virtualization support for next generation clusters and datacenters with CPUs and accelerators.
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,950 organizations worldwide (in 86 countries). More than 518,000 downloads of this software have taken place from the project’s site. This software is empowering several InfiniBand clusters (including the 3rd, 14th, 17th, and 27th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 300 organizations in 35 countries. More than 28,000 downloads of these libraries have taken place. High-performance and scalable versions of the TensorFlow and Caffe frameworks are available from http://hidl.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
RESEARCH PAPERS
Shortest paths in Dragonfly systems
Ryland Curtsinger and David Bunde, Knox College, USA
Dragonfly is a topology for high-performance computer systems designed to exploit technology trends and meet challenging system constraints, particularly on power. In a Dragonfly system, compute nodes are attached to switches, the switches are organized into groups, and the network is organized as a two-level clique, with an edge between every switch in a group and an edge between every pair of groups. This means that every pair of switches is separated by at most three hops, one within a source group, one from the source group to the destination group, and one within the destination group. Routing using paths of this form is typically called “minimal routing”. In this paper, we show that the resulting paths are not always the shortest possible. We then propose a new class of paths that can be used without additional networking hardware and count its members that are shorter than or of equal length.
Effects of Congestion Management on Energy Saving Techniques in Interconnection Networks
Felix Zahn, Pedro Yebenes, Jesus Escudero-Sahuquillo, Pedro Javier Garcia and Holger Froening, Heidelberg University, Germany
In post-Dennard scaling energy becomes more and more important. While most components in data-center and supercomputer become increasingly energy-proportional, this trend seems to pass on interconnection networks. Although previous studies have shown huge potential for saving energy in interconnects, the associated performance decrease seems to be obstructive. An increase of execution time can be caused by a decreased bandwidth as well as by transition times which links need to reconfigure and are not able to transmit data. This leads to more contention on the network than usually interconnects have to deal with.
Congestion management is used in similar situations to limit the impact if these contentions only to single links and avoiding them to congest the entire network. Therefore, we propose combining energy saving policies and congestion management queueing schemes in order to maintain performance while saving energy. For synthetic hotspot traffic, which we use to stress the network, this combination shows a promising result for multiple topologies. In 3D torus, k-ary n-tree, and dragonfly this combination provides a more than 50% lower latency and increases energy efficiency by more than 50% compared to the baseline. Although both techniques aim for fundamentally different goals, none of the investigated configurations seems to suffer any disadvantages form their combination.
INVITED TALKS
Revisiting the Dragonfly Topology in High-Performance Interconnection Networks
Prof. John Kim, Associate Professor, KAIST, South Korea
High-radix routers in high-performance computing were proposed to exploit the increasing router pin bandwidth. Given the high-radix routers, a new topology, Dragonfly, was proposed 10 years ago to take advantage of the high-radix routers and the signaling technology. The Dragonfly topology has also been implemented in real systems. In this talk, I will re-visit the Dragonfly topology and in particular, the benefits and the challenges associated with the topology. In addition, I will try to answer if the Dragonfly is the most efficient topology for high-performance computing today.
John Kim is currently an associate professor in the School of Electrical Engineering at KAIST (Korea Advanced Institiute of Science and Technology) in Daejeon, Korea. John Kim received his Ph.D. from Stanford University and B.S/M.Eng from Cornell University. His research interests include computer architecture, interconnection networks, security, and mobile systems. Prior to graduate school, he worked on the design of several microprocessors at Intel and at Motorola.
Routerless Network-on-Chip and Its Optimizations by Deep Reinforcement Learning
Prof. Lizhong Chen, Oregon State University, USA
Current and future many-core processors in HPCs demand highly efficient on-chip networks to connect hundreds or even thousands of processing cores. While router-based networks-on-chip (NoCs) offer excellent scalability, they also incur significant power and area overhead due to complex router structures. In this talk, we present a new class of on-chip networks, referred to as Routerless NoCs, where costly routers are eliminated. An example design is proposed that utilizes on-chip wiring resources smartly to achieve comparable hop count and scalability as router-based NoCs. To explore the large design space of routerless NoCs more effectively, we further develop a novel deep reinforcement learning framework that learns the optimal loop selection for routerless NoCs with various design constraints. Compared with a conventional mesh, the proposed design is able to achieve 9.5X reduction in power, 7.2X reduction in area, 2.5X reduction in zero-load packet latency, and 1.7X increase in throughput. These results demonstrate the viability and promising benefits of the routerless paradigm, and call for future works that continue to improve the performance, reliability and security of routerless NoCs.
Lizhong Chen is currently an Assistant Professor in the School of Electrical Engineering and Computer Science at Oregon State University. Dr. Chen received his Ph.D. in Computer Engineering and M.S. in Electrical Engineering from the University of Southern California in 2014 and 2011, respectively. His research interests include computer architecture, interconnection networks, GPUs, machine learning, hardware accelerators and emerging IoT technologies. Dr. Chen is the recipient of National Science Foundation’s CRII Award (2016), NSF CAREER Award (2018), Best Paper Nomination at IEEE NAS (2018) and has received multiple other awards/grants from government agencies and industry. He has served as the program committee member in top computer architecture conferences (e.g., ISCA, DAC, ICS), reviewer for a number of IEEE and ACM journals (e.g., TC, TPDS, TVLSI, TCAD, TACO), and panelist of multiple NSF panels related to computer systems architecture. Dr. Chen is also the founder and organizer of the annual International Workshop on AIDArc (AI-assisted Design for Architecture), held in conjunction with ISCA.
Hardware implementations of streaming Processing in the Network NICs
Prof. Torsten Hoefler, ETH Zürich, Switzerland
We will briefly recap the network acceleration framework streaming Processing in the Network (sPIN), which can easiest be described as “CUDA for the network card”. Then, we will describe two different hardware prototype implementations. One using an ARM-based Smart NIC of Broadcom and a second one using a custom RISC-V-based microarchitecture emulated in FPGAs. We will discuss trade-offs and performance for both implementations with several use-cases. Overall, we conclude that an implementation is feasible and should take advantage of the properties of the sPIN programming model.
Torsten Hoefler is an Associate Professor of Computer Science at ETH Zürich, Switzerland. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC. He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the “Collective Operations and Topologies” working group. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference SC10, SC13, SC14, EuroMPI’13, HPDC’15, HPDC’16, IPDPS’15, and other conferences. He published numerous peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. He received the Latsis prize of ETH Zurich as well as an ERC starting grant in 2015. His research interests revolve around the central topic of “Performance-centric System Design” and include scalable networks, parallel programming techniques, and performance modeling. Additional information about Torsten can be found on his homepage at htor.inf.ethz.ch.