A Novel OCR Approach Based on Document Layout Analysis and Text Block Classification.
Weiheng ZhuYuanfeng LiuLiang HaoPublished in: CIS (2016)
Keyphrases
- document processing
- printed documents
- document analysis
- document classification
- text documents
- text lines
- document images
- web documents
- document categorization
- text classification
- information retrieval
- text classifiers
- digital documents
- supervised machine learning
- text recognition
- keywords
- text clustering
- scanned documents
- preprocessing
- machine learning
- automatic categorization
- digital libraries
- decision trees
- textual content
- optical character recognition
- image classification
- classification accuracy
- text categorization
- page segmentation
- text extraction
- training set
- page layout
- semantic information
- information retrieval systems
- text mining
- support vector
- textual features
- feature selection
- document content
- scientific documents
- automatic text classification
- scanned images
- related documents
- text content
- text corpus
- feature extraction
- character recognition
- relevant documents
- bag of words
- feature space
- support vector machine