Publication: MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition.