GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference.

Ali Hadi Zadeh Andreas Moshovos

Published in: CoRR (2020)

Keyphrases

efficient inference
low latency
probabilistic inference
structured prediction
conditional random fields
hidden variables
probabilistic model
information extraction
high speed
energy consumption
message passing
approximate inference
linear models
fully connected