Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Cámara Moreno, Jesús; Cuenca, Javier; García, Luis Pedro; Giménez, Domingo

doi:10.1016/j.parco.2014.03.011

Título

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

dc.contributor.author	Cámara Moreno, Jesús
dc.contributor.author	Cuenca, Javier
dc.contributor.author	García, Luis Pedro
dc.contributor.author	Giménez, Domingo
dc.date.accessioned	2025-01-27T18:28:31Z
dc.date.available	2025-01-27T18:28:31Z
dc.date.issued	2014
dc.identifier.citation	Parallel Computing, 2014, Volume 40, Issue 7, Pages 309-327	es
dc.identifier.issn	0167-8191	es
dc.identifier.uri	https://uvadoc.uva.es/handle/10324/74462
dc.description	Producción Científica	es
dc.description.abstract	The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP.	es
dc.format.mimetype	application/pdf	es
dc.language.iso	eng	es
dc.publisher	Elsevier	es
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es
dc.subject	Computación Paralela	es
dc.subject	Auto-Tuning	es
dc.subject.classification	Auto-tuning	es
dc.subject.classification	Linear Algebra	es
dc.subject.classification	Performance Modeling	es
dc.title	Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems	es
dc.type	info:eu-repo/semantics/article	es
dc.rights.holder	Elsevier B.V.	es
dc.identifier.doi	10.1016/j.parco.2014.03.011	es
dc.relation.publisherversion	https://www.sciencedirect.com/science/article/abs/pii/S0167819114000416	es
dc.identifier.publicationfirstpage	309	es
dc.identifier.publicationissue	7	es
dc.identifier.publicationlastpage	327	es
dc.identifier.publicationtitle	Parallel Computing	es
dc.identifier.publicationvolume	40	es
dc.peerreviewed	SI	es
dc.description.project	Este trabajo forma parte del proyecto de investigación TIN2012-38341-C04-03 financiado por el Ministerio de Economía (MINECO)	es
dc.type.hasVersion	info:eu-repo/semantics/publishedVersion	es
dc.subject.unesco	1203 Ciencia de Los Ordenadores	es
dc.subject.unesco	3304 Tecnología de Los Ordenadores	es

Arquivos deste item

Nome:: Autotuned_Nested_Parallelism_P ...
Tamanho:: 668.1Kb
Formato:: PDF

Visualizar/Abrir

Este item aparece na(s) seguinte(s) coleção(s)

DEP41 - Artículos de revista [112]

Mostrar registro simples