Hybrid Training Data for Historical Text OCR.
Jirí MartínekLadislav LencPavel KrálAnguelos NicolaouVincent ChristleinPublished in: ICDAR (2019)
Keyphrases
- training data
- text recognition
- optical character recognition
- printed documents
- document processing
- document images
- historical documents
- ocr systems
- text retrieval
- historical manuscripts
- document analysis
- historical data
- page layout
- supervised learning
- decision trees
- character recognition
- text extraction
- database
- text mining
- text processing
- test data
- error correction
- training process
- training set
- prior knowledge
- textual data
- training dataset
- learning algorithm
- information retrieval
- free text
- printed text
- class labels
- labeled data
- preprocessing
- data sets
- keywords
- scanned documents
- digital libraries
- document image analysis
- text regions
- handwriting recognition
- training instances
- text lines
- text analysis
- key concepts
- noisy data
- hidden markov models
- classification accuracy
- semantic information
- post processing