Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.
Amey AgrawalNitin KediaAshish PanwarJayashree MohanNipun KwatraBhargav S. GulavaniAlexey TumanovRamachandran RamjeePublished in: OSDI (2024)
Keyphrases
- response time
- low latency
- resource utilization
- real time
- inference engine
- bayesian inference
- bayesian networks
- inference process
- prefetching
- inference mechanism
- bayesian model
- random fields
- probabilistic inference
- resource management
- data transfer
- end to end
- information extraction
- computational complexity
- grammatical inference
- data sets