posted by user: paralllri || 1745 views || tracked by 2 users: [display]

FTXS 2010 : 1st International Workshop on Fault-Tolerance for HPC at Extreme Scale

FacebookTwitterLinkedInGoogle

Link: http://institute.lanl.gov/resilience/conferences/ftxs2010/
 
When Jun 28, 2010 - Jul 1, 2010
Where Chicago, IL, USA
Submission Deadline Mar 15, 2010
Notification Due Apr 9, 2010
Final Version Due Apr 30, 2010
 

Call For Papers


============================================================
DSN 2010: 1st International Workshop on
Fault-Tolerance for HPC at Extreme Scale
Chicago, Illinois, USA
============================================================


Objectives and Challenges
With the emergence of many-core processors, accelerators, and alternative/heterogeneous
architectures, the HPC community faces a new challenge: a scaling in number of processing
elements that supersedes the historical trend of scaling in processor frequencies. The attendant
increase in system complexity has first-order implications for fault tolerance. Mounting evidence
invalidates traditional assumptions of HPC fault tolerance: faults are increasingly multiple-point
instead of single-point and interdependent instead of independent; silent failures and silent data
corruption are no longer rare enough to discount; stabilization time consumes a larger fraction
of useful system lifetime, with failure rates projected to exceed one per hour on the largest
systems; and application interrupt rates are apparently diverging from system failure rates.
The workshop will convene a diverse group of experts in HPC and fault-tolerance to inaugurate a
fault-tolerance research agenda for responding to the unique challenges that extreme scale and
complexity. Innovation is encouraged and discussion of non-traditional approaches is welcome.


Topics
Assuming hardware and software errors will be inescapable at extreme scale, this workshop will
consider aspects of fault tolerance peculiar to extreme scale that include, but are not limited to:
• Quantitative assessments of cost in terms of power, performance, and resource impacts
of fault-tolerant techniques, such as checkpoint restart, that are redundant in space,
time or information;
• Novel fault-tolerance techniques and implementations of emerging hardware and
software technologies that guard against silent data corruption (SDC) in memory, logic,
and storage and provide end-to-end data integrity for running applications;
• Studies of hardware / software tradeoffs in error detection, failure prediction, error
preemption, and recovery;
• Advances in monitoring, analysis, and control of highly complex systems;
• Highly scalable fault-tolerant programming models;
• Metrics and standards for measuring, improving and enforcing the need for and
effectiveness of fault-tolerance;
• Failure modeling and scalable methods of reliability, availability, performability and
failure prediction for fault-tolerant HPC systems;
• Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations;
• Benchmarks and experimental environments, including fault-injection and accelerated
lifetime testing, for evaluating performance of resilience techniques under stress.


Participation and Paper Submission
Submissions are expected in the following categories:
• Extended abstracts that propose original ideas in the field;
• Work-in-progress reports that present considerable progress in the challenging areas;
• Position papers that identify open issues or discuss existing solutions.
The submissions shall be sent electronically, must conform to IEEE conference proceedings
style and should not exceed six pages including all text, appendices, and figures.


Important Dates
Submission of papers: March 15, 2010
Author notification: April 9, 2010
Camera ready papers: April 30, 2010


Further Information
http://institute.lanl.gov/resilience/conferences/ftxs2010/
Workshop location, registration and accommodation: http://www.dsn.org

Related Resources

XLOOP 2025   The 7th Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing
REX-IO 2025   5th Workshop on Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads @ IEEE Cluster 2025
ESSA 2025   6th Workshop on Extreme-Scale Storage and Analysis
HICSS 2026   Hawaii International Conference on System Sciences Mini Track: Advances in Software Resilience: New Frontiers in Testing, Verification, Compliance, and Fault-Tolerance Mechanisms
DFT 2025   38th IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems
GraphSys 2025   The Third Workshop on Serverless, Extreme-Scale, and Sustainable Graph Processing Systems (Co-Located with Europar 2025)
CARLA 2025   LATIN AMERICA HIGH PERFORMANCE COMPUTING CONFERENCE
EDCC 2025   20th European Dependable Computing Conference
SI_Fault_IoV_OTJ 2025   SI on Tools, Techniques, and Applications for Fault Tolerant and Reliable Vehicular Ad-hoc Networks (VANET) and Internet of Vehicles (IoV), The Open Transportation Journal
HPDC- 2025   ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2025: Call for Papers