Layout-Aware Text Representations Harm Clustering Documents by Type.
Catherine Finegan-DollakAshish VermaPublished in: Insights (2020)
Keyphrases
- text clustering
- text documents
- document clustering
- page layout
- information retrieval
- free text
- digital documents
- text data
- web documents
- clustering algorithm
- keywords
- text mining
- text collections
- automatically discovering
- text fragments
- document collections
- document analysis
- text retrieval
- textual data
- textual content
- topic detection
- text analysis
- k means
- automatic categorization
- text representation
- text content
- information retrieval systems
- plagiarism detection
- document categorization
- textual information
- document content
- semantic information
- latent semantic analysis
- document set
- document processing
- text information
- newspaper articles
- text classification
- clustering method
- natural language text
- document corpus
- handwritten text
- semantic representations
- topic segmentation
- document representation
- key concepts
- scientific literature
- relevant documents
- xml documents
- retrieval systems
- linguistic analysis
- printed documents
- multimedia documents
- related documents
- text corpus
- text categorization
- document image retrieval
- wordnet
- text corpora
- sentence level
- information extraction
- document retrieval
- spoken documents
- retrieval engine
- handwritten documents
- document structure
- document level
- news stories
- document images
- web pages