AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation.
Kai WangShijian DengJing ShiDimitrios HatzinakosYapeng TianPublished in: CoRR (2024)
Keyphrases
- audio visual
- visual data
- video summarization
- multi modal
- multimedia
- audio features
- meeting room
- audio visual content
- visual information
- temporal context
- multimodal fusion
- audio visual speech recognition
- video data
- sports video
- multi stream
- video sequences
- video scene
- data sets
- emotion recognition
- visual features
- video content
- multimedia data
- image data
- hidden markov models
- human motion
- temporal information
- space time