Automatic Risk-based Selective Redundancy for Fault-tolerant Task-parallel HPC Applications.
Omer SubasiOsman S. UnsalSriram KrishnamoorthyPublished in: ESPM2@SC (2017)
Keyphrases
- fault tolerant
- fault tolerance
- distributed systems
- interconnection networks
- load balancing
- high availability
- state machine
- high performance computing
- distributed computing
- parallel processing
- risk management
- parallel implementation
- shared memory
- computer architecture
- mobile agent system
- safety critical
- message passing interface
- data replication
- distributed memory
- fine grained