Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions.
Lei ZhuJianhua GuYunlan WangTianhai ZhaoZhennao CaiPublished in: J. Supercomput. (2015)
Keyphrases
- fault tolerance
- distributed systems
- fault tolerant
- response time
- fault management
- single point of failure
- distributed computing
- high scalability
- load balancing
- high performance computing
- peer to peer
- replicated databases
- high availability
- group communication
- mobile agents
- database replication
- computing systems
- error detection
- databases
- artificial intelligence
- scientific computing
- metadata
- data sets
- computing environments
- wireless sensor
- distributed environment
- computer systems
- expert systems