Prompt Cache: Modular Attention Reuse for Low-Latency Inference.

In Gim Guojun Chen Seung-Seob Lee Nikhil Sarda Anurag Khandelwal Lin Zhong

Published in: MLSys (2024)

Keyphrases

low latency
high bandwidth
high speed
high throughput
virtual machine
massive scale
highly efficient
real time
continuous query processing
query processing
data sets
stream processing
data access
low complexity
main memory
data collection
data management