Login / Signup
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
Ramachandran Ramjee
Published in:
CoRR (2024)
Keyphrases
</>
response time
low latency
resource utilization
probabilistic inference
inference engine
prefetching
computational complexity
neural network
bayesian inference
data transfer
bayesian model
bayesian networks
high speed
graphical models
trade off
highly efficient
genetic algorithm