Soft errors and their effect in HPC: the problem and some solutions.


Luigi Carro, UFRGS, Brazil -- 27-01-2015


Soft errors caused by ionizing radiation are already an issue for current technologies, and with the estimates of transistors scaling to 5.9 nm by 2026, computing devices will be forced to employ some reliability mechanism to ensure proper computation at a reasonable cost. Previously a major concern only in aerospace and avionic applications, soft errors have been recently reported also on the Earth level, in applications ranging from high performance computing to critical embedded systems, such as automotive, for instance. We believe that a knowledge on the causes of soft errors and on the pros and cons of different approaches to mitigate their effects is valuable for those working not only on microprocessor reliability, but also for those concerned with the design of software systems, since some error mitigation techniques might require the redesign of the computational stack. This way one can avoid the huge cost in terms of area, performance or energy incurred in traditional techniques. In this talk we will focus on ionizing radiation as the source for soft errors and explain how experiments with real radiation are performed in order to evaluate the susceptibility of digital circuits to soft errors.</pre>



Luigi Carro was born in Porto Alegre, Brazil, in 1962. He received the Electrical Engineering and the MSc degrees from Universidade Federal do Rio Grande do Sul (UFRGS), Brazil, in 1985 and 1989, respectively. From 1989 to 1991 he worked at ST-Microelectronics, Agrate, Italy, in the R&D group. In 1996 he received the Dr. degree in the area of Computer Science from Universidade Federal do Rio Grande do Sul (UFRGS), Brazil. He is presently a full professor at the Applied Informatics Department at the Informatics Institute of UFRGS, in charge of Computer Architecture and Organization courses at the undergraduate levels. He is also a member of the Graduation Program in Computer Science at UFRGS, where he is co-responsible for courses on Embedded Systems, Digital signal Processing, and VLSI Design. His primary research interests include embedded systems design, validation, automation and test, fault tolerance for future technologies and rapid system prototyping. He has advised more than 20 graduate students, and has published more than 150 technical papers on those topics. He has authored the book Digital systems Design and Prototyping (2001-in Portuguese) and is the co-author of Fault-Tolerance Techniques for SRAM-based FPGAs (2006-Springer), Dynamic Reconfigurable Architectures and Transparent optimization Techniques (2010-Springer) and Adaptive Systems (Springer 2012). In 2007 he received the prize FAPERGS - Researcher of the year in Computer Science. His most updated resume is located in For the latest news, please check


CE Tweets