Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English.
Charangan VasantharajanLaksika TharmalingamUthayasanker ThayasivamPublished in: IALP (2022)
Keyphrases
- parallel corpus
- indian languages
- cross lingual
- open source
- optical character recognition
- document images
- character recognition
- machine translation
- word recognition
- language independent
- cross language information retrieval
- statistical machine translation
- query translation
- machine translation system
- word alignment
- language identification
- target language
- document analysis
- language modeling
- sentence pairs
- text lines
- cross language
- parallel corpora
- text classification
- document clustering
- translation model
- machine vision
- source language
- news articles
- information retrieval
- bilingual dictionaries