Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events.
Xuenan XuHeinrich DinkelMengyue WuKai YuPublished in: ICASSP (2021)
Keyphrases
- audio content
- text graphics
- event detection
- text mining
- news stories
- temporal information
- cross media retrieval
- audio signal
- text to speech
- multimedia
- visual information
- visual features
- signal processing
- keywords
- semantic context
- metadata
- textual descriptions
- news video
- soccer video
- human language
- acoustic features
- textual information
- music information retrieval
- video clips
- text retrieval
- information retrieval
- text information
- text regions
- semantic representation
- database
- human activities
- multi modal