Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages.
Gowtham RameshSumanth DoddapaneniAravinth BheemarajMayank JobanputraRaghavan AKAjitesh SharmaSujit SahooHarshita DiddeeMahalakshmi JDivyanshu KakwaniNavneet KumarAswin PradeepKumar DeepakVivek RaghavanAnoop KunchukuttanPratyush KumarMitesh Shantadevi KhapraPublished in: CoRR (2021)
Keyphrases
- parallel corpora
- language independent
- comparable corpora
- cross lingual
- machine translation
- cross language information retrieval
- labor intensive
- bilingual dictionaries
- statistical machine translation
- language resources
- query translation
- cross language
- cross lingual information retrieval
- machine translation system
- sentence pairs
- word pairs
- document collections
- sentence level
- semi automatic
- fully automated
- news articles
- text classification
- machine learning
- wikipedia articles
- word recognition
- translation model
- n gram
- question answering
- text categorization
- information extraction
- knowledge base