Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation.
Bar IluzTomasz LimisiewiczGabriel StanovskyDavid MarecekPublished in: IJCNLP (1) (2023)
Keyphrases
- machine translation
- data distribution
- language independent
- cross lingual
- data streams
- natural language processing
- information extraction
- high dimensional data
- target language
- cross language information retrieval
- n gram
- named entities
- machine translation system
- concept drift
- statistical machine translation
- data points
- chinese english
- index structure
- training set
- word sense disambiguation
- nearest neighbor
- source language
- natural language
- word level
- out of vocabulary
- high dimensional
- data analysis
- machine learning
- parallel corpus
- databases