RT info:eu-repo/semantics/article T1 Toward a BLAS library truly portable across different accelerator types A1 Rodríguez Gutiez, Eduardo A1 Moretón Fernández, Ana A1 González Escribano, Arturo A1 Llanos Ferraris, Diego Rafael AB Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several implementations specifically tuned for different types of computing platforms, including coprocessors. Some examples include the one bundled with the Intel MKL library, which targets Intel CPUs or Xeon Phi coprocessors, or the cuBLAS library, which is specifically designed for NVIDIA GPUs. Nowadays, computing nodes in many supercomputing clusters include one or more different coprocessor types. To fully exploit these platforms might require programs that can adapt at run-time to the chosen device type, hardwiring in the program the code needed to use a different library for each device type that can be selected. This also forces the programmer to deal with different interface particularities and mechanisms to manage the memory transfers of the data structures used as parameters. This paper presents a unified, performance-oriented, and portable interface for BLAS. This interface has been integrated into a heterogeneous programming model (Controllers) which supports groups of CPU cores, Xeon Phi accelerators, or NVIDIA GPUs in a transparent way. The contribution of this paper includes: An abstraction layer to hide programming differences between diverse BLAS libraries; new types of kernel classes to support the context manipulation of different external BLAS libraries; a new kernel selection policy that considers both programmer kernels and different external libraries; a complete new Controller library interface for the whole collection of BLAS routines. This proposal enables the creation of BLAS-based portable codes that can execute on top of different types of accelerators by changing a single initialization parameter. Our software internally exploits different preexisting and widely known BLAS library implementations, such as cuBLAS, MAGMA, or the one found in Intel MKL. It transparently uses the most appropriate library for the selected device. Our experimental results show that our abstraction does not introduce significant performance penalties, while achieving the desired portability. PB The Journal of Supercomputing SN 0920-8542 YR 2019 FD 2019 LK http://uvadoc.uva.es/handle/10324/39040 UL http://uvadoc.uva.es/handle/10324/39040 LA spa NO Towards a BLAS library truly portable between different Accelerator types. Eduardo Rodríguez Gutiez, Arturo González Escribano, Diego R. Llanos. The Journal of Supercomputing (Q2). First online: 10 june 2019. DOI: 10.1007/s11227-019-02925-3 DS UVaDOC RD 22-nov-2024