Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages.
Gowtham RameshSumanth DoddapaneniAravinth BheemarajMayank JobanputraRaghavan AKAjitesh SharmaSujit SahooHarshita DiddeeMahalakshmi JDivyanshu KakwaniNavneet KumarAswin PradeepSrihari NagarajDeepak KumarVivek RaghavanAnoop KunchukuttanPratyush KumarMitesh Shantadevi KhapraPublished in: Trans. Assoc. Comput. Linguistics (2022)
Keyphrases
- parallel corpora
- comparable corpora
- language independent
- cross lingual
- machine translation
- cross language information retrieval
- cross lingual information retrieval
- bilingual dictionaries
- statistical machine translation
- machine translation system
- language resources
- labor intensive
- query translation
- word pairs
- sentence pairs
- sentence level
- wikipedia articles
- cross language
- semi automatic
- information retrieval systems
- document collections
- target language
- test collection
- text classification
- information extraction