Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures.
Tanmaey GuptaSanjeev KrishnanRituraj KumarAbhishek VijeevBhargav S. GulavaniNipun KwatraRamachandran RamjeeMuthian SivathanuPublished in: EuroSys (2024)
Keyphrases
- deep learning
- error recovery
- low cost
- deep architectures
- unsupervised learning
- error detection
- text understanding
- supervised learning
- machine learning
- training examples
- plan generation
- fault tolerance
- training set
- artificial intelligence
- online learning
- fault tolerant
- knowledge representation
- distributed systems
- named entities
- text mining
- packet loss
- mental models
- video transmission
- natural language understanding
- natural language processing
- active learning
- object recognition