Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer.
Yazhou ZuAlireza GhaffarkhahHoang-Vu DangBrian TowlesSteven HandSafeen HudaAdekunle BelloAlexander KolbasovArash RezaeiDayou DuSteve LacyHang WangAaron WisnerChris LewisHenri BahiniPublished in: NSDI (2024)
Keyphrases
- machine learning
- search engine
- website
- machine learning methods
- building blocks
- scale space
- learning algorithm
- data mining
- text classification
- neural network
- computational biology
- machine learning algorithms
- knowledge acquisition
- computational intelligence
- information extraction
- knowledge discovery
- social media
- real time
- information retrieval
- computer vision
- e learning
- reinforcement learning
- pattern recognition
- web pages
- model selection
- natural language processing
- genetic algorithm
- learning systems
- computer science
- knowledge representation
- scale invariant
- machine learning approaches
- floating point
- massively parallel
- supervised machine learning