GigaST: A 10, 000-hour Pseudo Speech Translation Corpus.
Rong YeChengqi ZhaoTom KoChutong MengTao WangMingxuan WangJun CaoPublished in: INTERSPEECH (2023)
Keyphrases
- statistical machine translation
- spontaneous speech
- speech recognition
- lexical features
- conversational speech
- machine translation system
- english words
- machine translation
- parallel corpus
- training corpus
- parallel corpora
- spoken language
- text to speech
- manually annotated
- query translation
- speech signal
- chinese english
- speech synthesis
- automatic speech recognition
- translation model
- human machine interaction
- cross language information retrieval
- language resources
- cross lingual
- spanish language
- out of vocabulary
- recognition engine
- broadcast news
- word alignment
- comparable corpora
- open domain
- language model
- multiword
- language acquisition
- noisy environments
- natural language text
- sentence pairs
- test set