Anagrafe della ricerca

HF - Methods for the detection and identification of hardware faults in HPC systems for AI applications

Durata:
01/11/2023 - 31/01/2026
Responsabile scientifico:
Tipo di progetto:
PNRR – Missione 4
Ente finanziatore:
PRIVATI (ICSC - Centro Nazionale di ricerca HPC Biga data and C Quantum Computing / MUR)
Ruolo PoliTo:
Partner

Abstract

Currently, huge and complex HPC (High-Performance Computing) systems represent strategic systems for supporting various processes and applications, including scientific computing and financial domains. However, the required reliability for their operation necessitates efficient and intelligent mechanisms for identifying any possible hardware fault, particularly during field operation. The probability of occurrence of such faults increases not only with the complexity of HPC systems but also due to the technological evolution of electronic devices. This technology vulnerability highlights the need for a new generation of fault detection mechanisms in hardware to avoid or handle data and workload corruption [1,2]. The complexity and heterogeneity of HPC systems increasingly complicate the detection and identification of any hardware faults within them. Such faults may produce failures that can interfere with Artificial Intelligence applications that frequently participate in critical flows and/or processes. It is therefore essential to have tools and methods for the timely detection and identification of any faults in the system. Therefore, the main objective of the project is to develop efficient techniques for identifying hardware faults in HPC systems during the operational phase, with particular attention to architectures used for executing Artificial Intelligence applications (processors, memories, hardware accelerators). The planned activities also aim to assess the cost and ease of integration into existing flows of the proposed techniques, including the use of appropriate test cases that can be provided by the ISP.

Strutture coinvolte

Partner

  • INTESA SANPAOLO S.P.A. - Coordinatore
  • POLITECNICO DI TORINO - AMMINISTRAZIONE CENTRALE

Parole chiave

Settori ERC

PE6_4 - Theoretical computer science, formal methods, and quantum computing

Obiettivi di Sviluppo Sostenibile (Sustainable Development Goals)

Obiettivo 9. Costruire un'infrastruttura resiliente e promuovere l'innovazione ed una industrializzazione equa, responsabile e sostenibile

Budget

Costo totale progetto: € 1,00
Contributo totale progetto: € 1,00
Costo totale PoliTo: € 1,00
Contributo PoliTo: € 1,00