Prompt Cache: Modular Attention Reuse for Low-Latency Inference.

In Gim Guojun Chen Seung-Seob Lee Nikhil Sarda Anurag Khandelwal Lin Zhong

Published in: CoRR (2023)

Keyphrases

low latency
high bandwidth
high throughput
high speed
real time
highly efficient
massive scale
virtual machine
query processing
main memory
continuous query processing
stream processing
distributed systems
network traffic