Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation.
Bar IluzTomasz LimisiewiczGabriel StanovskyDavid MarecekPublished in: CoRR (2023)
Keyphrases
- data distribution
- machine translation
- language independent
- natural language processing
- target language
- data streams
- high dimensional data
- index structure
- cross lingual
- information extraction
- cross language information retrieval
- data points
- word sense disambiguation
- n gram
- parallel corpora
- statistical machine translation
- concept drift
- named entities
- parallel corpus
- machine learning
- machine translation system
- chinese english
- text retrieval
- query translation
- database
- word alignment
- training set
- out of vocabulary
- databases