A Benchmark and Dataset for Post-OCR text correction in Sanskrit.
Ayush MaheshwariNikhil SinghAmrith KrishnaGanesh RamakrishnanPublished in: EMNLP (Findings) (2022)
Keyphrases
- text recognition
- printed documents
- document processing
- error correction
- optical character recognition
- ocr systems
- document images
- document analysis
- text extraction
- information retrieval
- text retrieval
- post processing
- database
- text mining
- preprocessing
- page layout
- character recognition
- keywords
- handwriting recognition
- handwritten documents
- benchmark datasets
- scanned images
- machine translation
- machine learning
- free text
- feature set
- text categorization
- scanned documents
- web documents
- text documents
- text processing
- textual data
- document clustering
- text data