Flagging Critical Components to Prevent Transient Faults in Real-Time Systems

This paper proposes the use of metrics in design space exploration that highlight where in the structure of the model and at what point in the behaviour, prevention is needed against transient faults. Previous approaches to tackle transient faults focused on recovery after detection. Almost no research has been directed towards preventive measures. But in real-time systems, hard deadlines are performance requirements that absolutely must be met and a missed deadline constitutes an erroneous action and a possible system failure. This paper proposes the use of metrics to assess the system design to flag where transient faults may have significant impact. These tools then allow the design to be changed to minimize that impact, and they also flag where particular design techniques – such as coding of communications or memories – need to be applied in later stages of design.





References:
[1] M. Zhang, S. Mitra, T. M. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S.
Kim, N. R. Shanbhag, and S. J. Patel, "Sequential Element Design With
Built-In Soft Error Resilience," Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 14, pp. 1368-1378, 2006.
[2] M. Zhang, "Analysis and design of soft-error tolerant circuits," Ph.D.
Thesis, University of Illinois at Urbana-Champaign, United States --
Illinois, 2006.
[3] Z. Xinping and Q. Wei, "Prototyping a fault-tolerant multiprocessor SoC
with run-time fault recovery," presented at 43rd ACM/IEEE Design
Automation Conference , pp. 53 - 56, 2006.
[4] V. Narayanan and Y. Xie, "Reliability concerns in embedded system
designs," Computer, vol. 39, pp. 118-120, 2006.
[5] M. W. Rashid, E. J. Tan, M. C. Huang, and D. H. Albonesi, "Powerefficient
error tolerance in chip multiprocessors," Micro, IEEE, vol. 25,
pp. 60-70, 2005.
[6] Meaney, S. B. Swaney, P. N. Sanda, and L. Spainhower, "IBM z990 soft
error detection and recovery," Device and Materials Reliability, IEEE
Transactions on, vol. 5, pp. 419-427, 2005.
[7] S. Krishnamohan, "Efficient techniques for modeling and mitigation of
soft errors in nanometer-scale static CMOS logic circuits," Ph.D. Thesis,
Michigan State University, United States -- Michigan, 2005.
[8] R. K. Iyer, N. M. Nakka, Z. T. Kalbarczyk, and S. Mitra, "Recent
advances and new avenues in hardware-level reliability support," Micro,
IEEE, vol. 25, pp. 18-29, 2005.
[9] B. T. Gold, J. Kim, J. C. Smolens, E. S. Chung, V. Liaskovitis, E.
Nurvitadhi, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, "TRUSS: a
reliable, scalable server architecture," Micro, IEEE, vol. 25, pp. 51-59,
2005.
[10] J. M. Cazeaux, D. Rossi, M. Omana, C. Metra, and A. Chatterjee, "On
transistor level gate sizing for increased robustness to transient faults,"
presented at 11th IEEE International On-Line Testing Symposium, pp.
23 - 28, 2005.
[11] S. Borkar, "Designing reliable systems from unreliable components: the
challenges of transistor variability and degradation," Micro, IEEE, vol.
25, pp. 10-16, 2005.
[12] Y. Xie, L. Li, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin,
"Reliability-aware co-synthesis for embedded systems," presented at
15th IEEE International Conference on Application-Specific Systems,
Architectures and Processors, pp. 41 - 50, 2004.
[13] M. Hiller, A. Jhumka, and S. Neeraj, "EPIC: profiling the propagation
and effect of data errors in software," Transactions on Computers, vol.
53, pp. 512-530, 2004.
[14] A. G. Mohamed, S. Chad, T. N. Vijaykumar, and P. Irith, "Transientfault
recovery for chip multiprocessors," IEEE Micro, vol. 23, pp. 76,
2003.
[15] T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery
using simultaneous multithreading," presented at 29th Annual
International Symposium on Computer Architecture, pp. 87-98, 2002.
[16] N. Oh, P. P. Shirvani, and E. J. McCluskey, "Error detection by
duplicated instructions in super-scalar processors," Reliability, IEEE
Transactions on, vol. 51, pp. 63-75, 2002.
[17] J. Ray, J. C. Hoe, and B. Falsafi, "Dual use of superscalar datapath for
transient-fault detection and recovery," presented at 34th ACM/IEEE
International Symposium on Microarchitecture, pp. 214 - 224, 2001.
[18] S. K. Reinhardt and S. S. Mukherjee, "Transient fault detection via
simultaneous multithreading," presented at 27th International
Symposium on Computer Architecture, pp. 25- 36, 2000.
[19] T. M. Austin, "DIVA: a reliable substrate for deep submicron
microarchitecture design," presented at 32nd Annual International
Symposium on Microarchitecture, pp. 196 - 207, 1999.