Sr. Hardware Safety Architect
Automotive System Safety Engineering,
NVIDIA
Automotive System Safety Engineering,
NVIDIA
Compiler Micro-architecture Lab,
Arizona State University
Internet of Things Group (IOTG),
Intel
Compiler Micro-architecture Lab,
Arizona State University
Mathematical & Theoretical Biology Institute,
Arizona State University
Ira A Fulton Schools of Engineering,
Arizona State University
Master of Science
Computer Engineering (2016)
Arizona State University, Tempe
Bachelor of Technology
Electronics & Communication Engineering (2013)
GITAM University, Visakhapatnam
An ideal solution for soft error tolerance should hide the effect of soft errors from user and provide correct results at expected time. Software solutions are attractive because they can provide flexible reliability without imposing any hardware modifications. Our investigation of state-of-the-art error recovery techniques reveals that they suffer from poor coverage (ability to detect and correctly recover from soft errors). This paper presents InCheck (In-application Checkpointing and Recovery) as an effective, safe and timely software technique for complete error coverage. The key features of InCheck are: verified register preservation, single memory location checkpoints, and safe & timely recovery. To evaluate the effectiveness of InCheck, we performed more than 210,000 fault injection experiments on different hardware components of an ARM cortex53-like processor running MiBench applications. The original and SWIFT-R (state-of-the-art) protected programs suffered from 8000 and 1800 instances of wrong outputs respectively, but when protected by InCheck, there was no failure.
Soft errors are considered as the main reliability challenge for sub-nanoscale microprocessors. Software-level soft error resilience schemes are desirable because they require no hardware modifications and their protection can be tuned based on the application requirements. However, existing software-level error tolerant schemes do not provide high-level of protection. In this work, we present NEMESIS - a compiler-level fine-grain soft error detection, diagnosis and recovery technique that can provide high degree of error-resiliency. NEMESIS runs three versions of computations and detects soft errors by checking the results of all memory write and branch operations. In the case of mismatch, NEMESIS recovery routine reverts the effect of error from the architectural state of the program and program resumes its normal execution. Our extensive μ-architectural-level fault injection experiments results show that NEMESIS transformation is able to detect all soft errors and recover from 97% of detected errors.
Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the overhead of three copies execution. Backward recovery techniques only need two copies of execution, but suffer from check-pointing overhead. In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes. To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times.
An ideal solution for soft error tolerance should hide the effect of soft errors from user and provide correct results at expected time. Software solutions are attractive because they can provide flexible reliability without imposing any hardware modifications. Our investigation of state-of-the-art error recovery techniques reveals that they suffer from poor coverage (ability to detect and correctly recover from soft errors). This paper presents InCheck (In-application Checkpointing and Recovery) as an effective, safe and timely software technique for complete error coverage. The key features of InCheck are: verified register preservation, single memory location checkpoints, and safe & timely recovery. To evaluate the effectiveness of InCheck, we performed more than 210,000 fault injection experiments on different hardware components of an ARM cortex53-like processor running MiBench applications. The original and SWIFT-R (state-of-the-art) protected programs suffered from 8000 and 1800 instances of wrong outputs respectively, but when protected by InCheck, there was no failure.
Soft errors are considered as the main reliability challenge for sub-nanoscale microprocessors. Software-level soft error resilience schemes are desirable because they require no hardware modifications and their protection can be tuned based on the application requirements. However, existing software-level error tolerant schemes do not provide high-level of protection. In this work, we present NEMESIS - a compiler-level fine-grain soft error detection, diagnosis and recovery technique that can provide high degree of error-resiliency. NEMESIS runs three versions of computations and detects soft errors by checking the results of all memory write and branch operations. In the case of mismatch, NEMESIS recovery routine reverts the effect of error from the architectural state of the program and program resumes its normal execution. Our extensive μ-architectural-level fault injection experiments results show that NEMESIS transformation is able to detect all soft errors and recover from 97% of detected errors.
Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the overhead of three copies execution. Backward recovery techniques only need two copies of execution, but suffer from check-pointing overhead. In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes. To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times.
Feel free to contact me on: