Publication: L-STAP: Learned Spatio-Temporal Adaptive Pooling for Video Captioning.