Improving Audio Generation with Visual Enhanced Caption.
Yi YuanDongya JiaXiaobin ZhuangYuanzhe ChenZhengxi LiuZhuo ChenYuping WangYuxuan WangXubo LiuMark D. PlumbleyWenwu WangPublished in: CoRR (2024)
Keyphrases
- visual information
- visual features
- visual data
- cross modal
- news video
- multimedia
- low level
- signal processing
- image classification
- audio visual
- video indexing and retrieval
- visual cues
- visual perception
- semantic context
- audio signals
- text to speech
- feature extraction
- high level
- audio features
- speaker identification
- bounding box
- object recognition
- text extraction
- audio video
- content based video retrieval
- visual speech