Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART.
Aniket TatheAnand KambleSuyash KumbharkarAtharva BhandareAnirban C. MitraPublished in: CoRR (2024)
Keyphrases
- fine tuned
- domain specific
- fine tuning
- human actions
- multimedia event detection
- video dataset
- crowd sourced
- video frames
- benchmark datasets
- machine translation
- video sequences
- video event detection
- photo collections
- event detection
- video analysis
- training dataset
- web videos
- spatio temporal
- trecvid multimedia event detection
- database
- video database
- dynamic scenes
- handwriting recognition
- human activities
- video data
- language model