A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora.
Aleksi VesantoFilip GinterHannu SalmiAsko NivalaTapio SalakoskiPublished in: NODALIDA (2017)
Keyphrases
- document corpus
- text documents
- text collections
- text corpus
- historical documents
- text corpora
- text data
- web documents
- information retrieval
- digital documents
- text content
- topic segmentation
- keywords
- document processing
- word frequency
- document analysis
- text clustering
- textual content
- document collections
- document content
- document clustering
- text mining
- scientific papers
- scientific documents
- text classifiers
- multimedia documents
- document images
- semantic information
- related documents
- printed documents
- historical manuscripts
- natural language processing
- text summarization
- document representation
- document retrieval
- text classification
- training corpus
- textual documents
- document structure
- text analysis
- textual data
- information retrieval systems
- retrieval engine
- free text
- electronic documents
- automatic text summarization
- text categorization
- news articles
- linguistic patterns
- search engine
- structured documents
- topic models
- automatic summarization
- latent semantic analysis
- statistical machine translation