Program

TUESDAY 8TH SEPTEMBER 2015
8:15-8:30	Presentation
8:30-10:00	Keynote Chair: Francisco J. Alfaro, University of Castilla-La Mancha, Spain Speaker: John Kim, KAIST, South Korea Think Globally, Act Locally: Issues in Hierarchical Large-Scale Interconnection Networks
10:00-10:30	Coffee Break
10:30-12:30	Technical Session 1 Chair: Jesus Escudero Sahuquillo, Technical University of Valencia, Spain Throughput Unfairness in Dragonfly Networks under Realistic Traffic Patterns. Pablo Fuentes, Enrique Vallejo, Cristobal Camarero, Ramon Beivide (University of Cantabria) and Mateo Valero (UPC/BSC) VEF Traces: A Framework for Modelling MPI Traffic in Interconnection Network Simulators. Francisco J. Andujar, Juan A. Villar, Jose L. Sanchez, Francisco J. Alfaro (University of Castilla-La Mancha) and Jesus Escudero-Sahuquillo (Technical University of Valencia) SlimUpdate: Minimal Routing Update for Performance-based Reconfigurations in Fat-Trees. Feroz Zahid (Simula Labs), Ernst Gunnar Gran (Simula Labs), Bartosz Bogdanski (Oracle), Bjørn Dag Johnsen (Oracle) and Tor Skeie (University of Oslo) Fault-Tolerant Routing for Exascale Supercomputer: The BXI Routing Architecture. Pierre Vignéras and Jean-Noël Quintin (Atos/Bull) Multipath Load Balancing for M × N Communication Patterns on the Blue Gene/Q Supercomputer Interconnection Network. Huy Bui (University of Illinois), Robert Jacob (ANL), Preeti Malakar (ANL), Venkatram Vishwanath (ANL), Andrew Johnson (University of Illinois), Michael Papka (ANL) and Jason Leigh (University of Hawai’i at Mānoa)
12:30-14:00	Lunch
14:00-15:30	Technical Session 2 Chair: Jesus Escudero Sahuquillo, Technical University of Valencia, Spain InfiniBand Verbs Optimizations for Remote GPU Virtualization. Carlos Reaño and Federico Silla (Technical University of Valencia) Efficient Queuing Schemes for HoL-Blocking Reduction in Dragonfly Topologies with Minimal-Path Routing. Pedro Yebenes Segura (University of Castilla-La Mancha), Jesus Escudero-Sahuquillo (Technical University of Valencia), Pedro Javier Garcia (University of Castilla-La Mancha) and Francisco J. Quiles (University of Castilla-La Mancha) Modeling a Large Data-Acquisition Network in a Simulation Framework. Tommaso Colombo (CERN), Holger Fröning (University of Heidelberg), Pedro Javier García (University of Castilla-La Mancha) and Wainer Vandelli (CERN)
15:30-16:00	Coffee Break
16:00-17:45	Panel How can we dramatically increase network scalability? Moderator: Pedro Javier Garcia, University of Castilla-La Mancha, Spain Panelists: – Dhabaleswar K Panda, Ohio State University, USA – Eitan Zahavi, Mellanox, Israel – Satoshi Matsuoka, Tokyo Institute of Technology, Japan
17:45-18:00	Farewell

Keynote

Chair: Francisco J. Alfaro, University of Castilla-La Mancha, Spain

Think Globally, Act Locally : Issues in Hierarchical Large-Scale Interconnection Networks

Abstract: Many interconnection networks are built hierarchically — for example, the global topology can be a hierarchical topology (such as the Dragonfly topology) or the endpoints in the global network can consist of multiple nodes interconnected together. For such hierarchical interconnection networks, the design and architecture needs to carefully consider the impact of the local network on the global network and vice-versa. In this talk, I will talk about how if both the local and the global networks are not properly considered, the performance of the overall interconnection network can be impacted. In particular, the first part of the talk will address the issues in proper global adaptive routing on the Dragonfly topology for large-scale networks. In the second part of the talk, I will address the processor-interconnect, or the interconnection network in a multi-socket server, and how it can impact the overall system performance.

John Kim is an associate professor in the School of Computing at KAIST. He received his Ph.D. from Stanford University and his B.S. and M.Eng from Cornell University. Prior to graduate school, John has worked on the design of several processors at Motorola and Intel. His research interest includes computer architecture, interconnection networks, and mobile systems.

Back to the top

Technical Session 1

Chair: Jesus Escudero-Sahuquillo, Technical University of Valencia, Spain

All the speakers will have 20 minutes for their presentations, plus 4 minutes for questions and answers

Throughput Unfairness in Dragonfly Networks under Realistic Traffic Patterns
Pablo Fuentes, Enrique Vallejo, Cristobal Camarero, Ramon Beivide (University of Cantabria) and Mateo Valero (UPC/BSC)Abstract: Dragonfly networks have a two-level hierarchical arrangement of the network routers, and allow for a competitive cost-performance solution in large systems. Non-minimal adaptive routing is employed to fully exploit the path diversity and increase the performance under adversarial traffic patterns. Previous works have demonstrated the presence of throughput unfairness under certain adversarial traffic patterns, and proposed different alternatives to effectively combat such effect. Throughput unfairness prevents a balanced use of the resources across the network nodes and degrades severely the performance of any application running on an affected node. In this paper we introduce a new traffic pattern denoted adversarial consecutive (ADVc), which portrays a real use case, and evaluate its impact on network performance and throughput fairness. Furthermore, we assess the limitations of global misrouting policies to alleviate this effect, and the impact of transit-over injection priority on throughput unfairness under ADVc traffic.
Back to the top
VEF Traces: A Framework for Modelling MPI Traffic in Interconnection Network Simulators
Francisco J. Andujar, Juan A. Villar, Jose L. Sanchez, Francisco J. Alfaro (University of Castilla-La Mancha) and Jesus Escudero-Sahuquillo (Technical University of Valencia)Abstract: Simulation is often used to evaluate the behaviour and measure the performance of computing systems. Specifically, in high-performance interconnection networks, the simulation has been extensively considered to verify the behaviour of the network itself and to evaluate its performance. In this context, network simulation must be fed with network traffic, also referred to as network workload, whose nature has been traditionally synthetic. These workloads can be used for the purpose of driving studies on network performance, but often such workloads are not accurate enough if a realistic evaluation is pursued. For this reason, other non-synthetic workloads have gained popularity over last decades since they are best to capture the realistic behaviour of existing applications. In this paper, we present the VEF traces framework, a self-related trace model, and all their associated tools. The main novelty of this framework is that, unlike existing ones, it does not provide a network simulation framework, but only offers an MPI task simulation framework, which allows to use the MPI-based network traffic by any third-party network simulator, since this framework does not depend on any specific simulation platform.
Back to the top
SlimUpdate: Minimal Routing Update for Performance-based Reconfigurations in Fat-Trees
Feroz Zahid (Simula Labs), Ernst Gunnar Gran (Simula Labs), Bartosz Bogdanski (Oracle), Bjørn Dag Johnsen (Oracle) and Tor Skeie (University of Oslo)Abstract: As the size of high-performance computing systems grows, the number of events requiring a network reconfiguration, as well as the complexity of each reconfiguration, is likely to increase. In large systems, the probability of component failure is high. At the same time, with more network components, ensuring high utilization of network resources becomes challenging. Reconfiguration in interconnection networks, like InfiniBand (IB), typically involves computation and distribution of a new set of routes in order to maintain connectivity and performance. In general, current routing algorithms do not consider the existing routes in a network when calculating new ones. Such configuration-oblivious routing might result in substantial modifications to the existing paths, and the reconfiguration becomes costly as it potentially involves a large number of source-destination pairs. In this paper, we propose a novel routing algorithm for IB based fat-tree topologies, SlimUpdate. SlimUpdate employs techniques to preserve existing forwarding entries in switches to ensure a minimal routing update, without any performance penalty, and with minimal computational overhead. We present an implementation of SlimUpdate in OpenSM, and compare it with the current de facto fat-tree routing algorithm. Our experiments and simulations show a decrease of up to 80% in the number of total path modifications when using SlimUpdate routing, while achieving similar or even better performance than the fat-tree routing in most reconfiguration scenarios.
Back to the top
Fault-Tolerant Routing for Exascale Supercomputer: The BXI Routing Architecture
Pierre Vignéras and Jean-Noël Quintin (Atos/Bull)Abstract: BXI, Bull eXascale Interconnect, is the new interconnection network developed by Atos for High Performance Computing. It has been designed to meet the requirements of exascale supercomputers. At such scale, faults have to be expected and dealt with transparently so that applications remain unaffected by them. BXI features various mechanisms for this purpose, one of which is the BXI routing component presented in this paper. The BXI routing module computes the full routing tables for a 64k nodes fat-tree in a few minutes. But with partial re-computation it can withstand numerous inter-router link failures without any noticeable impact on running applications.
Back to the top
Multipath Load Balancing for M × N Communication Patterns on the Blue Gene/Q Supercomputer Interconnection Network
Huy Bui (University of Illinois), Robert Jacob (ANL), Preeti Malakar (ANL), Venkatram Vishwanath (ANL), Andrew Johnson (University of Illinois), Michael Papka (ANL) and Jason Leigh (University of Hawai’i at Mānoa)Abstract: Achievable networking performance of applications in a supercomputer depends on the exact combination of the communication patterns of the applications and the routing algorithms used by the supercomputer. In order to achieve the highest networking performance for the applications the routing algorithms need to be designed optimally for those communication patterns. However, while communication patterns usually have a wide variation from application to application and even from phase to phase in an application, routing algorithms have a limited variation and usually are optimized for typical communication patterns. This results in high networking perfor- mance for favored communication patterns but low networking performance for others. In this paper we present approaches for improving networking performance by rebalancing load on physical links on the Blue Gene Q supercomputer. We realize our approaches in a framework called OPTIQ and demonstrate the efficacy of our framework via a set of benchmarks. Our results show that we can achieve 30% higher throughput on experiment with data and patterns from a real application. The improvemnt can be up to serveral times higher throughput than default MPI Alltoallv used in the Blue Gene Q supercomputer for certain communication patterns.
Back to the top

Technical Session 2

Chair: Jesus Escudero-Sahuquillo, Technical University of Valencia, Spain

All the speakers will have 20 minutes for their presentations, plus 4 minutes for questions and answers

InfiniBand Verbs Optimizations for Remote GPU Virtualization
Carlos Reaño and Federico Silla (Technical University of Valencia)Abstract: The use of InfiniBand networks to interconnect high performance computing clusters has considerably increased during the last years. So much so that the majority
of the supercomputers included in the TOP500 list either use Ethernet or InfiniBand interconnects. Regarding the latter, due to the complexity of the InfiniBand programming API (i.e., InfiniBand Verbs) and the lack of documentation, there are not enough recent available studies explaining how to optimize applications to get the maximum performance from this fabric. In this paper we expose two different optimizations to be used when developing applications using InfiniBand Verbs, each providing an average bandwidth improvement of 3.68% and 217.14%, respectively. In addition, we show that when combining both optimizations, the average bandwidth gain is 43.29%. This bandwidth increment is key for remote GPU virtualization frameworks, Actually, this noticeable gain translates into a reduction of up to 35% in execution time of applications using remote GPU virtualization frameworks.
Back to the top
Efficient Queuing Schemes for HoL-Blocking Reduction in Dragonfly Topologies with Minimal-Path Routing
Pedro Yebenes Segura (University of Castilla-La Mancha), Jesus Escudero-Sahuquillo (Technical University of Valencia), Pedro Javier Garcia (University of Castilla-La Mancha) and Francisco J. Quiles (University of Castilla-La Mancha)Abstract: HPC systems are growing in number of connected endnodes, making the network a main issue in their design. In order to interconnect large systems, dragonfly topologies have become very popular in the latest years as they achieve high scalability by exploiting high-radix switches. However, dragonfly high performance may drop severely due to the Head-of-Line blocking effect. Many techniques have been proposed for dealing with this harmful effect, the most effective ones being those especially designed for a specific topology and a specific routing algorithm. In this paper we present a queuing scheme called Hierarchical Two-Levels Queuing, designed specially to reduce Hol blocking in fully-connected dragonfly networks that use minimal-path routing. This proposal boosts network performance compared with other techniques that require fewer network resources than the others. Besides, an upgrade for existing queuing schemes for improving their performance is explained.
Back to the top
Modeling a Large Data-Acquisition Network in a Simulation Framework
Tommaso Colombo (CERN), Holger Fröning (University of Heidelberg), Pedro Javier García (University of Castilla-La Mancha) and Wainer Vandelli (CERN)

Abstract: The ATLAS detector at CERN records particle collision “events” delivered by the Large Hadron Collider. Its data-acquisition system identifies, selects, and stores interesting events in near real-time, with an aggregate throughput of several 10 GB/s. It is a distributed software system executed on a farm of roughly 2000 commodity worker nodes communicating via TCP/IP on an Ethernet network. Event data fragments are received from the many detector readout channels and are buffered, collected together, analyzed and either stored permanently or discarded. This system, and data-acquisition systems in general, are sensitive to the latency of the data transfer from the readout buffers to the worker nodes. Challenges affecting this transfer include the many-to-one communication pattern and the inherently bursty nature of the traffic. In this paper we introduce the main performance issues brought about by this workload, focusing in particular on the so-called TCP incast pathology. Since performing systematic studies of these issues is often impeded by operational constraints related to the mission-critical nature of these systems, we focus instead on the development of a simulation model of the ATLAS data-acquisition system, used as a case study. The simulation is based on the well-established OMNeT++ framework. Its results are compared with existing measurements of the system’s behavior. The successful reproduction of the measurements by the simulations validates the modeling approach. We share some of the preliminary findings obtained from the simulation, as an example of the additional possibilities it enables, and outline the planned future investigations.
Back to the top

Panel

How can we dramatically increase network scalability?

Moderator:

Pedro Javier Garcia, University of Castilla-La Mancha, Spain

Panelists:

Dhabaleswar K Panda, Ohio State University, USA
Eitan Zahavi, Mellanox, Israel
Satoshi Matsuoka, Tokyo Institute of Technology, Japan

The panelists will have 20-25 minutes for their talks, followed by 20 minutes for questions and answers

Topic:

In their talks, the panelists will address the next questions:

In order to reach Exascale performance, what are the necessary changes we need to introduce in the interconnection network?
Using photonics to reduce signal attenuation seems to be mandatory, but at what levels? Just for node to node interconnects? Within the motherboard? Within the processor chip?
Most of the latency is not in the interconnect hardware. How should communication protocols be modified? How will those changes affect the programming model for massively parallel applications?
The interconnect is consuming an increasing fraction of the computer power. In addition to using photonics to reduce losses, what other techniques should be implemented to reduce power consumption? How will they affect network congestion and message latency (average latency and jitter)?

Back to the top