Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs.
Haojun XiaZhen ZhengXiaoxia WuShiyang ChenZhewei YaoStephen YounArash BakhtiariMichael WyattDonglin ZhuangZhongzhu ZhouOlatunji RuwaseYuxiong HeShuaiwen Leon SongPublished in: USENIX ATC (2024)