A Digitization Pipeline for Mixed-Typed Documents Using Machine Learning and Optical Character Recognition.
Tizian MatschakFlorian RampoldMalte HellmeierChristoph PrinzSimon TrangPublished in: DESRIST (2022)
Keyphrases
- optical character recognition
- ocr systems
- machine learning
- printed documents
- word spotting
- scanned documents
- document images
- character recognition
- text recognition
- historical manuscripts
- printed text
- text lines
- handwriting recognition
- information retrieval systems
- document collections
- information retrieval
- character segmentation
- natural language processing
- machine vision
- document analysis
- feature selection
- image binarization
- document classification
- text processing
- text mining
- higher order
- document retrieval
- text documents
- relevant documents
- arabic documents
- page segmentation
- text classification
- text extraction
- text regions
- textual information
- document clustering
- web documents
- metadata