GigaST: A 10, 000-hour Pseudo Speech Translation Corpus.
Rong YeChengqi ZhaoTom KoChutong MengTao WangMingxuan WangJun CaoPublished in: CoRR (2022)
Keyphrases
- statistical machine translation
- spontaneous speech
- conversational speech
- speech recognition
- parallel corpus
- lexical features
- broadcast news
- cross language information retrieval
- parallel corpora
- speech signal
- machine translation
- machine translation system
- spoken language
- automatic speech recognition
- chinese english
- english words
- speech synthesis
- text to speech
- audio visual
- query translation
- training corpus
- language resources
- human machine interaction
- translation model
- sentence pairs
- endpoint detection
- natural language
- speaker identification
- multimodal interfaces
- open domain
- linguistic features
- multiword
- emotion recognition
- recognition engine
- noisy environments
- spoken dialog
- parallel texts
- noun phrases
- audio stream
- emotional state