PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents.
Nan ZhangConnor T. HeatonSean Timothy OkonskyPrasenjit MitraHilal Ezgi ToramanPublished in: CoRR (2024)
Keyphrases
- optical character recognition
- scientific documents
- character recognition
- text recognition
- ocr systems
- document images
- character segmentation
- handwriting recognition
- digital libraries
- printed documents
- page segmentation
- image binarization
- feature set
- natural language
- text lines
- scientific literature
- query expansion
- language model
- text extraction