CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs.
Ahmed El-KishkyVishrav ChaudharyFrancisco GuzmánPhilipp KoehnPublished in: EMNLP (1) (2020)
Keyphrases
- cross lingual
- web documents
- machine translation
- language modeling
- cross lingual information retrieval
- information extraction
- cross language
- language independent
- web pages
- event extraction
- web search engines
- keywords
- text classification
- parallel corpus
- pairwise
- document collections
- mono lingual
- prefetching
- transfer learning
- news articles
- n gram
- machine translation system
- language model
- data analysis
- vector space model
- query translation
- web logs
- learning algorithm
- information retrieval