Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs.
Shixun WuYujia ZhaiJinyang LiuJiajun HuangZizhe JianBryan M. WongZizhong ChenPublished in: ICS (2023)
Keyphrases
- fault tolerance
- fault tolerant
- distributed systems
- distributed computing
- load balancing
- high availability
- replicated databases
- response time
- group communication
- peer to peer
- mobile agents
- high performance computing
- fault management
- database replication
- graphics processing units
- real time
- failure recovery
- general purpose
- scientific computing
- parallel processing
- data replication
- databases
- high scalability
- single point of failure
- error detection
- data sets