P. Shivakumar, S. Keckler, C. Moore, and D. Burger. Exploiting Microarchitectural Redundancy For Defect Tolerance. The 21st International Conference on Computer Design (ICCD). October 2003. (PDF)
J. Srinivasan, S. Adve, P. Bose, and J. Rivers. Lifetime Reliability: Toward an Architectural Solution. International Symposium on Computer Architecture. Pages 70-80. May 2005. (PDF)
S. Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro. Pages 10-16. November 2005. (PDF)
S. Borkar, N. Jouppi, and P. Stenstrom. Microprocessors in the Era of Terascale Integration. Design, Automation & Test in Europe Conference. April 2007. (PDF)
J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The Impact of Technology Scaling on Lifetime Reliability. The International Conference on Dependable Systems and Networks (DSN-04). 2004. (PDF)
J. Srinivasan, S. Adve, P. Bose, J. Rivers, and C. Hu. RAMP: A Model for Reliability Aware MicroProcessor Design. December 2003. (PDF)
S. Lee, C. Lee, C. Choi, and D. Kwong. Time-Dependent Dielectric Breakdown in poly-Si CVD HfO2 Gate Stack. Reliability Physics Symposium Proceedings. 2002. (PDF)
J. Meindl, Q. Chen, and J. Davis. Limits on Silicon Nanoelectronics for Terascale Integration. Science. Pages 2044-2049. September 2001. (PDF)
E. Wu, J. Suñéb, W. Laia, E. Nowaka, J. McKennaa, A. Vayshenkerc and D. Harmon. Interplay of Voltage and Temperature Acceleration of Oxide Breakdown for Ultra-Thin Oxides. Microelectronic Engineering. Pages 25-31. October 2001. (PDF)
M. Erez, N. Jayasena, T. Knight, and W. Dally. Fault Tolerance Techniques for the Merrimac Streaming Supercomputer. Proceedings of the SC|05 Conference. November 2005. (PDF)
H. Al-Asaad, and A. Sarvi. Fault Tolerance for Multiprocessor Systems Via Time Redundant Task Scheduling. Proc. International Conference on VLSI. Pages 51-57. 2003. (PDF)
H. Al-Asaad, and E. Czeck. Concurrent Error Correction in Iterative Circuits by Recomputing with Partitioning and Voting. Proceedings of the IEEE VLSI Test Symposium. Pages 174-177. 1993. (PDF)
P. Agrawal. Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy. IEEE Transactions on Computers. Pages 358-362. March 1988. (PDF)
D. Ness, C. Hescott, and D. Lilja. Exploring Subsets of Standard Cell Libraries to Exploit Natural Fault Masking Capabilities for Reliable Logic. ACM Great Lakes Symposium on VLSI. March 2007. (PDF)
D. Ness, C. Hescott, and D. Lilja. Modeling Failure Reduction for Combinatorial Logic using Gate Level NMR. Reliability and Maintainability Symposium (RAMS). January 2007. (PDF)
A. Maheshwari, W. Burleson, and R. Tessier. Trading Off Transient Fault Tolerance and Power Consumption in Deep Submicron (DSM) VLSI Circuits. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Pages 299-311. March 2004. (PDF)
A. Maheshwari, I. Koren, and W. Burleson. Techniques for Transient Fault Sensitivity Analysis and Reduction in VLSI Circuits. Proceedings of International Symposium on Defect and Fault Tolerance in VLSI Systems. Pages 597- 604. November 2003. (PDF)
A. Maheshwari, W. Burleson, and R. Tessier. Trading off Reliability and Power Consumption in Ultra-Low Power Systems. International Symposium on Quality Electronic Design. Pages 361-366. March 2002. (PDF)
H. Cha, and J. Patel. Latch Design for Transient Pulse Tolerance. International Conference on Computer Design. Pages 385-388. 1994. (PDF)
W. Chen, R. Gong, F. Liu, K. Dai, and Z. Wang. Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy. Proceedings of the International Conference on Embedded Systems and Applications. November 2006. (PDF)
W. Chen, R. Gong, F. Liu, K. Dai, and Z. Wang. Two New Space-Time Triple Modular Redundanct Techniques for Improving Fault Tolerance of Computer Systems. Proceedings of IEEE International Conference on Computer and Information Technology. Pages 175-180. September 2006. (PDF)
D. Abts, J. Thompson and G. Schwoerer. Architectural Support for Mitigating DRAM Soft Errors in Large-Scale Supercomputers. Cray Inc. 2006. (PDF)
D. Rossi, V. van Dijk, R. Kleihorst, A. Nieuwland, and C. Metra. Power Consumption of Fault Tolerant Codes: The Active Elements. On-Line Testing Symposium. Pages 61-67. July 2003. (PDF)
K. Mihic, T. Simunic, and G. De Micheli. Reliability and Power Management of Integrated Systems. Proceedings of the EUROMICRO Systems on Digital System Design. Pages 5-11. August 2004. (PDF)
T. Simunic, K. Mihic, and G. De Micheli. Optimization of Reliability and Power Consumption in Systems on a Chip. Power And Timing Modeling, Optimization and Simulation. 2005. (PDF)
J. Samson Jr., L. De La Torre, P. Wiley, T. Stottlar, J. Ring. A Comparison of Algorithm-Based Fault Tolerance and Traditional Redundant Self-Checking for SEU Mitigation. Conference on Digital Avionics Systems. October 2001. (PDF)
F. Wang, K. Ramamritham, and J. Stankovic. Determining Redundancy Levels for Fault Tolerant Real-Time Systems. IEEE Transactions on Computers. Pages 292-301. February 1995. (PDF)
S. Ghosh, S. Basu, and N. Touba. Selecting Error Correcting Codes to Minimize Power in Memory Checker Circuits. Journal of Low Power Electronics. Pages 63-72. April 2005. (PDF)
M. Qureshi, O. Mutlu, Y. Patt. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. (PDF)
A Gonzalez, S. Mahlke, S. Mukherjee, R. Sendag, D. Chiou, J. Yi. Reliability: Fallacy or Reality?. IEEE Micro. Pages 36-45. December 2007. (PDF)
D. Sorin, M. Martin, M. Hill, D. Wood. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. International Symposium on Computer Architecture. Pages 123-134. 2002. (PDF)
M. Prvulovic, Z. Zhang, J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. International Symposium on Computer Architecture. Pages 111-122. 2002. (PDF)
T. Bressoud, F. Schneider. Hypervisor-Based Fault Tolerance. ACM Transactions on Computer Systems. Pages 80-107. February 1996. (PDF)
F. Sheldon, K. Kavi, R. Tausworthe, J. Yu, R. Brettschneider, and W. Everett. Reliability Measurement: From Theory to Practice. IEEE Software. Pages 13-20. July 1992. (PDF)
D. Feitelson. On the Interpretation of Top500 Data. The International Journal of High Performance Computing Applications. Pages 146-153. May 1999. (PDF)
D. Feitelson. The Supercomputer Industry in Light of the Top500 Data. Computing in Science and Engineering. Pages 42-47. January 2005. (PDF)
J. Stearley. Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS). 2005. http://www.cs.sandia.gov/~jrstear/ras. (PDF)
A. Oliner, and J. Stearley. What Supercomputers Say: A Study of Five System Logs. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). 2007. (PDF)
W. Feng. Making a Case for Efficient Supercomputing. ACM-Queue. Pages 54-64. October 2003. (PDF)
S. Sharma, C. Hsu, and W. Feng. Making a Case for a Green500 List. Parallel and Distributed Processing Symposium (IPDPS). April 2006. (PDF)
B. Schroeder, and G. Gibson. A Large-scale Study of Failures in High-performance-computing Systems. Proceedings of the International Conference on Dependable Systems and Networks (DSN2006). June 2006. (PDF)
K. Ryan, and C. Reese. Estimating Reliability Trends for the World's Fastest Computer. Proceedings of the 16th international symposium on High performance distributed computing. Pages 43-54. 2007. (PDF)
C. Constantinescu. Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms. IEEE Transactions on Computers. Pages 886-894. September 2000. (PDF)
T. Lin, and D. Siewiorek. Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability. Pages 419-431. October 1990. (PDF)
R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. International Conference on Dependable Systems and Networks (DSN'04). Pages 772-781. June 2004. (PDF)
N. Aggarwal, P. Ranganathan, N. Jouppi, J. Smith, G. Krejcki, and K. Saluja. Fault Isolation for System-level Error Protection in Commodity Multi-Core Processors. Proceedings of the 2007 IEEE Workshop on Silicon Errors in Logic - System Effects. April 2007. (PDF)
N. Aggarwal, P. Ranganathan, N. Jouppi, and J. Smith. Configurable Isolation: Building High Availability Systems with Commodity Multi-Core Processors. Proceedings of the International Symposium on Computer Architecture (ISCA). June 2007. (PDF)
A. Oliner, and R. Sahoo. Evaluating Cooperative Checkpointing for Supercomputing Systems. International Symposium on Parallel and Distributed Processing. April 2006. (PDF)
E. Elnozahy, and S. Planck. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery. IEEE Transactions on Dependable and Secure Computing. Pages 97-108. April 2004. (PDF)
A. Oliner, R. Sahoo, J. Moreira, and M. Gupta. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. IEEE International Parallel & Distributed Processing Symposium. April 2005. (PDF)
S. Agarwal, R. Garg, M. Gupta, J. Moreira. Adaptive Incremental Checkpointing for Massively Parallel Systems. International Conference on Supercomputing. Pages 277-286. 2004. (PDF)
Failure Mechanisms and Models for Semiconductor Devices. JEDEC Solid State Technology Association. August 2003. http://www.jedec.org. (PDF)
Electronic Reliability Design Handbook: mil_hdbk_338b. Department of Defense. 1998. (PDF)
P. Shivakumar, S. Keckler, D. Burger, M. Kistler, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. Proceedings of the 2002 International Conference on Dependable Systems and Networks. Pages 389-398. 2002. (PDF)
K. Harris. Asymmetries in Soft-Error Rates in a Large Cluster System. IEEE Transactions on Device and Materials Reliability. Pages 336-342. September 2005. (PDF)
S. Michalak, K. Harris, N. Hengartner, B.Takala, and S. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability. Pages 329-335. September 2005. (PDF)
P. Hazucha, T. Kamik, J. Maiz, S. Walstra, B. Bloechel, J. Tschanz, G. Dermer, S. Hareland, P. Armstrong, and S. Borkar. Neutron Soft Error Rate Measurements in a 90-nm CMOS Process and Scaling Trends in SRAM from 0.25-pm to 90-nm Generation. IEEE International Electron Devices Meeting (IEDM). December 2003. (PDF)
C. Hescott, D. Ness, and D. Lilja. Scaling Analytical Models for Soft Error Rate Estimation Under a Multiple-Fault Environment. Euromicro Conference on Digital System Design Architectures, Methods and Tools. August 2007. (PDF)
S. Borkar. Thousand Core Chips-A Technology Perspective. Design Automation Conference. Pages 746-749. June 2007. (PDF)
E. Ogawa, K. Jinyoung, G. Haase, H. Mogul, and J. McPherson. Leakage, Breakdown, and TDDB Characteristics of Porous Low-k Silica-based Interconnect Dielectrics. Reliability Physics Symposium Proceedings, 2003. Pages 166-172. March 2003. (PDF)
S. Borkar. Design Challenges of Technology Scaling. IEEE Micro. Pages 23-29. July 1999. (PDF)
H. Bierhenke, and A. Wieder. Microelectronics - Challenges and Changes in the Next 15 Years. Proceedings of the 1st Electronic Packaging Technology Conference. October 1997. (PDF)
S. McKee. Reflections on the Memory Wall. Conference On Computing Frontiers. Pages 162-167. 2004. (PDF)
E. Grochowski, R. Hoyt. Future Trends in Hard Disk Drives. IEEE Transactions on Magnetics. Pages 1850-1854. May 1996. (PDF)
E. Pinheiro, W. Weber, L. Barroso. Failure Trends in a Large Disk Drive Population. Proceedings of USENIX Conference on File and Storage Technologies. February 2007. (PDF)