The workshop will take place in Hilton Austin, 500 E 4th St, Austin, TX 78701. The room for the HiPINEB workshop will be 400/402. Speakers of Technical Sessions have 20 minutes for their presentations, plus 2 minutes for questions from the audience.

7:30 – 8:30am – Breakfast (616AB)

8:30 – 8:40am – Opening pdf-icon
Pedro Javier Garcia, University of Castilla-La Mancha, Spain
Jesus Escudero-Sahuquillo, University of Castilla-La Mancha, Spain

8:40 – 10:00am – Keynote
Chairman: Francisco J. Quiles, University of Castilla-La Mancha, Spain

Issues in the Design of an Exascale Network pdf-icon
Bill Dally, Chief Scientist and SVP of Research in NVIDIA, and Stanford Professor

10:00 – 10:30am – Break (room 616AB)

10:30 – 12:00am – Panel

Massive-storage Networks vs Intensive-computing Networks pdf-icon
Moderator: John Kim, HP Labs / KAIST, South-Korea


  • Dave Mayhew, San Diego University, USA pdf-icon
  • Bill Dally, NVIDIA and Stanford University, USA pdf-icon
  • Torsten Hoefler, ETH Zurich, Switzerland pdf-icon

12:00 – 1:30pm – Lunch 

1:30 – 3:00pm – Technical Sessions
Chairman: Michihiro Koibuchi, National Institute of Informatics, Japan

3:00 – 3:30pm – Break (room 616AB)

3:30 – 4:55pm – Technical Sessions
Chairman: Jesus Escudero-Sahuquillo

4:55 – 5:00pm – Closing


Detailed Program


Issues in the Design of an Exascale Network
Bill Dally, Chief Scientist and SVP of Research in NVIDIA, and Stanford Professor


Bill Dally joined NVIDIA in January 2009 as chief scientist, after spending 12 years at Stanford University, where he was chairman of the computer science department. Dally and his Stanford team developed the system architecture, network architecture, signaling, routing and synchronization technology that is found in most large parallel computers today. Dally was previously at the Massachusetts Institute of Technology from 1986 to 1997, where he and his team built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanism from programming models and demonstrated very low overhead synchronization and communication mechanisms. From 1983 to 1986, he was at California Institute of Technology (CalTech), where he designed the MOSSIM Simulation Engine and the Torus Routing chip, which pioneered “wormhole” routing and virtual-channel flow control.
Back to top


Massive-storage Networks vs Intensive-computing Networks
Moderator: John Kim, HP Labs / KAIST, South Korea


  • Dave Mayhew, San Diego University
  • Bill Dally, NVIDIA and Stanford University
  • Torsten Hoefler, ETH Zurich, Switzerland

Short Bios


David Mayhew is an experienced cyber security professional with over 35 years of applied experience and a PhD in Computer Engineering. Dr. Mayhew has a thorough background teaching all undergraduate level computer science courses as well as graduate courses. While working at AMD Dr. Mayhew pioneered an entirely new switch technology termed Server Aggregation Switch (SAW), which is a mechanism for building monolithic switches on a scale and speed that is otherwise impossible. Since then Mayhew has concentrated on a hardware acceleration strategy that focuses on the software aspects of reconfigurable logic usage. Dr. Mayhew authored “Efficient C++: Performance Programming Techniques,” and has over 35 patents granted or in process.

Torsten Hoefler is an Assistant Professor of Computer Science at ETH Zürich, Switzerland. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC.  He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the “Collective Operations and Topologies” working group. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference SC10, SC13, SC14, EuroMPI 2013, IPDPS 2015, and other conferences.  He published numerous peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. He received the Latsis award of ETH Zurich as well as an ERC starting grant in 2015. His research interests revolve around the central topic of “Performance-centric System Design” and include scalable networks, parallel programming techniques, and performance modeling.  Additional information about Torsten can be found on his homepage at
Back to top

Technical Paper Abstracts

Dragonfly+: Low Cost Topology for Scaling Data Centers
Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni and Eitan Zahavi (Mellanox Technologies, Israel)

Dragonfly topology was introduced by Kim et al. aiming to decrease the cost and diameter of the network. The topology divides routers into groups connected by long links. Each group strives to implement high-radix virtual router, connected by a completely-connected topology. In this paper, we propose an extended Dragonfly+ network in which routers inside the group are connected in Clos-like topology. Dragonfly+ is superior to conventional Dragonfly due to the significantly larger number of hosts which it is able to support. In addition, Dragonfly+ supports similar or better bisectional bandwidth for various traffic patterns, and requires smaller number of buffers to avoid credit loop deadlocks in lossless networks. Moreover, we introduce a novel Fully Progressive Adaptive Routing algorithm with remote congestion notifications. To support our proposal we present analytical analysis and simulations.
Back to top

A case study on implementing virtual 5D torus networks using network components of lower dimensionality
Francisco Andujar-Muñoz, Juan A. Villar, Jose L. Sanchez, Francisco Alfaro and Holger Fröning (University of Castilla-La Mancha, Spain, and Ruprecht-Karls University of Heidelberg, Germany)

Several of the most powerful supercomputers in the Top500 and the Graph500 lists continue choosing a torus topology to interconnect a large number of compute nodes. In some cases, a torus network with five or six dimensions is implemented, however, one notices that the costs of implementing an interconnection network increase with the node degree. In previous works we defined and characterized the nD Twin (nDT) torus topology in order to virtually increase the dimensionality of a torus. This new topology reduces the distances between nodes and therefore increases network performance. In this work, we present how to build a 5DT torus network using commercial 6-port network cards. The main issues of this approach are detailed, and we present solutions these problems. Moreover we show, using the same components, that the performance of the 5DT torus network is higher than the performance of the 3D torus network for the same number of compute nodes.
Back to top

New link arrangements for Dragonfly networks
Madison Belka, Myra Doubet, Sofia Meyers, Rosemary Momoh, David Rincon-Cruz and David Bunde (Knox College, and Columbia University, USA)

Dragonfly networks have been proposed to exploit high-radix routers and optical links for high performance computing (HPC) systems. Such networks divide the switches into groups, with a local link between each pair of switches in a group and a global link between each group. Which specific switch serves as the endpoint of each global link is determined by the network’s global link arrangement. We propose two new global link arrangements, each designed using intuition of how to optimize bisection bandwidth when global links have high bandwidth relative to local links. Despite this, the new arrangements generally outperform previously-known arrangements for all bandwidth relationships.
Back to top

An Effective Queuing Scheme to Provide Slim Fly topologies with HoL Blocking Reduction and Deadlock Freedom for Minimal-Path Routing
Pedro Yebenes Segura, Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles and Torsten Hoefler (University of Castilla-La Mancha and ETH Zurich, Switzerland)

Interconnection network performance becomes a key issue in HPC systems as their size grows. In order to maximize network performance with the minimum quantity of network resources, Slim Fly topology was proposed. It offers a high network bandwidth and assures a network diameter of two. However, in congestion situations where the head-of-line blocking effect arises, the Slim Fly performance may drop dramatically. To alleviate this problem, we present first in this paper an analysis of congestion dynamics in Slim Fly networks. Then, based on this analysis, we propose the technique Slim Fly 2-Level Queuing (SF2LQ), especially designed for Slim Fly topologies using minimal-path routing. SF2LQ configures several virtual channels (VCs) grouped into two virtual networks to reduce HoL blocking while providing deadlock-free routing. This technique leverages the resources in network devices by efficiently using the available VCs. Finally, through simulation experiments, we show how our proposal boosts network performance while requiring a smaller number of VCs at input port buffers compared to with other techniques.
Back to top

Early Experiences with Saving Energy in Direct Interconnection Networks
Felix Zahn, Steffen Lammel and Holger Fröning (Ruprecht-Karls University of Heidelberg, Germany)

Energy is emerging to become one of the most crucial factors in design decisions for future large scale computing systems. Especially Exascale-installations will have to operate within hard power and energy constraints. Besides economical reasons, power consumption is also limited by a limited power distribution, cooling capabilities, and minimization of carbon footprints. While other components, such as processors, become more and more energy-proportional, interconnects are still highly energy-disproportional. Although interconnection networks are contributing only about 10-20% to the overall power consumption of High-Performance Computing (HPC) or Cloud systems, this fraction is likely to increase significantly in the near future. Therefore, power saving strategies are mandatory for improving energy efficiency and thereby performance within hard power constraints. In this work, we introduce a simple energy saving strategy, which switches links on and off, depending on the user’s performance constraints. Therefore, we adapted an existing OMNeT++ network simulator by adding new energy features. This simulator allows us to run traces of real world applications, including LULESH, NAMD, and Graph500 with different configurations. We show that this policy enables possible energy savings of up to 39% in interconnection networks. Furthermore, we demonstrate the impact of hardware design parameters, such as transition time, on possible power saving strategies.
Back to top

Extending commodity OpenFlow switches for large-scale HPC deployments
Mariano Benito, Enrique Vallejo, Ramón Beivide and Cruz Izu (University of Cantabria, Spain, and The University of Adelaide, Australia)

Commodity Ethernet networks are used in many HPC systems. Extensions based on OpenFlow have been proposed for large HPC deployments, considering scalability and power consumption concerns. Such designs employ low-diameter topologies to minimize power consumption, such as Flattened Butterflies or Dragonflies. However, these topologies require non-minimal adaptive routing to deal with varying traffic characteristics and avoid pathological behaviors. The solutions to this issue in previous work relies on Ethernet Pauses to adapt minimal or non-minimal routing, depending on the availability (Pause status) of each corresponding output port. Nevertheless, such design provides an undesired high average latency under adversarial traffic patterns and a reduction in peak throughput under uniform traffic. This paper identifies the causes of the issues presented above, and presents a preliminary study of alternative solutions based on exploiting commodity congestion notification messages (QCN, 802.1Qau), currently available in Datacenter switches. This work presents the main differences between a congestion control mechanism such as QCN, which performs injection throttling reducing average network load, and an adaptive routing mechanism, which diverts traffic away from the congested area but increases average network load. In particular, it identifies the difficulty of separating the cases of uniform traffic at saturation and adversarial traffic at low loads.
Back to top

Isolating jobs for security on high-performance fabrics
Matthieu Pérotin and Tom Cornebize (Atos, France, and ENS Lyon, France)

The various pieces of equipment in supercomputers are shared between jobs, that belong to different users. This situation raises security concerns. Jobs must not be able to conduct denial of service attacks targeting other jobs (voluntarily or accidentally). Moreover, job isolation must be guaranteed: unauthorized communication between two different jobs should not be allowed. However, high-performance interconnects are designed with performance as their main objective, and bypass the OS and its security models. In this paper, we show that by acting at the routing table level, it is possible to enforce job isolation without impacting job performance. Moreover, the isolation process can be dynamic, quick to set-up, with algorithms that are both independent from the routing algorithms and the interconnect topology.
Back to top

Knapp: A Packet Processing Framework for Manycore Accelerators
Junhyun Shim, Joongi Kim, Keunhong Lee and Sue Moon (SAP Labs Korea, Lablup Inc. and KAIST, South Korea)

High-performance network packet processing benefits greatly from parallel-programming accelerators such as Graphics Processing Units (GPUs). Intel Xeon Phi, a relative newcomer in this market, is a distinguishing platform because its x86-compatible vectorized architecture offers additional optimization opportunities. Its software stack exposes low-level communication primitives, enabling fine-grained control and optimization of offloading processes. Nonetheless, our microbenchmarks show that offloading APIs for Xeon Phi comes in short for combining low latency and high throughput for both I/O and computation. In this work, we exploit Xeon Phi’s low-level threading mechanisms to design a new offloading framework, Knapp, and evaluate it using simplified IP routing applications. Knapp lays the ground for full exploitation of Xeon Phi as a packet processing framework.
Back to top