Extracting Interlinear Glossed Text from LaTeX Documents.
Mathias SchennerSebastian NordhoffPublished in: LREC (2016)
Keyphrases
- text documents
- information retrieval
- digital documents
- scientific publications
- web documents
- free text
- textual content
- keywords
- text retrieval
- text data
- document content
- plagiarism detection
- document collections
- document analysis
- text collections
- latent semantic analysis
- text analysis
- multimedia documents
- text content
- document categorization
- newspaper articles
- automatic categorization
- document processing
- electronic documents
- natural language text
- automatically extracted
- text information
- textual information
- text mining
- retrieval engine
- topic segmentation
- document retrieval
- text segments
- textual documents
- text extraction
- text clustering
- textual data
- printed documents
- linguistic analysis
- related documents
- text classification
- key concepts
- document clustering
- scientific literature
- web pages
- journal articles
- handwritten text
- digital libraries
- information extraction
- information retrieval systems
- relevant documents
- semantic information
- spoken documents
- extractive summarization
- metadata
- xml documents
- multiword
- document set
- document structure
- mathematical expressions
- text corpora
- page layout
- multi document summarization
- semantic content
- text classifiers
- text corpus
- co occurrence
- structured documents
- wordnet
- text categorization
- scanned documents
- topic models
- sentence level
- user queries
- news stories
- text summarization
- natural language processing
- character recognition
- document representation
- document repositories