This title appears in the Scientific Report :
2014
Design of scalable PGAS collectives for NUMA and manycore systems
Design of scalable PGAS collectives for NUMA and manycore systems
The increasing number of cores per processor is turning multicore-based systems in pervasive. This involves dealing with multiple levels of memory in NUMA systems, accessible via complex interconnects in order to dispatch the increasing amount of data required. The key for efficient and scalable pro...
Saved in:
Personal Name(s): | Alvarez Mallon, Damian (Corresponding Author) |
---|---|
Contributing Institute: |
Jülich Supercomputing Center; JSC |
Published in: | 2014 |
Imprint: |
2014
|
Physical Description: |
239 p. |
Dissertation Note: |
University of A Coruna, Diss., 2014 |
Document Type: |
Dissertation / PhD Thesis |
Research Program: |
ohne Topic |
Subject (ZB): | |
Publikationsportal JuSER |
The increasing number of cores per processor is turning multicore-based systems in pervasive. This involves dealing with multiple levels of memory in NUMA systems, accessible via complex interconnects in order to dispatch the increasing amount of data required. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one-sided communications becomes more important in these systems, to avoid synchronization between pairs of processes in collective operations implemented using two-sided point to point functions. This Thesis proposes a series of collective algorithms that provide a good performance and scalability. They use hierarchical trees, overlapping one-sided communications, message pipelining and NUMA binding. An implementation has been developed for UPC, a PGAS language whose performance has been also assessed in this Thesis. In order to assess the performance of these algorithms a new microbenchmarking tool has been designed and implemented. The performance evaluation of the algorithms, conducted on 6 representative systems, with 5 different processor architectures and 5 different interconnect technologies, has shown generally good performance and scalability, outperforming leading MPI algorithms in many cases, which confirms the suitability of the developed algorithms for multi- and manycore architectures. |