Description: Design of scalable PGAS collectives for NUMA and manycore systems

This title appears in the Scientific Report : 2014

Design of scalable PGAS collectives for NUMA and manycore systems

The increasing number of cores per processor is turning multicore-based systems in pervasive. This involves dealing with multiple levels of memory in NUMA systems, accessible via complex interconnects in order to dispatch the increasing amount of data required. The key for efficient and scalable pro...

Personal Name(s):	Alvarez Mallon, Damian (Corresponding Author)
Contributing Institute:	Jülich Supercomputing Center; JSC
Published in:	2014
Imprint:	2014
Physical Description:	239 p.
Dissertation Note:	University of A Coruna, Diss., 2014
Document Type:	Dissertation / PhD Thesis
Research Program:	ohne Topic
Subject (ZB):	Dissertation > Diss.
	Publikationsportal JuSER

The increasing number of cores per processor is turning multicore-based systems in pervasive. This involves dealing with multiple levels of memory in NUMA systems, accessible via complex interconnects in order to dispatch the increasing amount of data required. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one-sided communications becomes more important in these systems, to avoid synchronization between pairs of processes in collective operations implemented using two-sided point to point functions. This Thesis proposes a series of collective algorithms that provide a good performance and scalability. They use hierarchical trees, overlapping one-sided communications, message pipelining and NUMA binding. An implementation has been developed for UPC, a PGAS language whose performance has been also assessed in this Thesis. In order to assess the performance of these algorithms a new microbenchmarking tool has been designed and implemented. The performance evaluation of the algorithms, conducted on 6 representative systems, with 5 different processor architectures and 5 different interconnect technologies, has shown generally good performance and scalability, outperforming leading MPI algorithms in many cases, which confirms the suitability of the developed algorithms for multi- and manycore architectures.