Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents.
Siyuan ChenSong MaoGeorge R. ThomaPublished in: ICDAR (2007)
Keyphrases
- document collections
- text collections
- information retrieval systems
- automatic categorization
- time stamped
- logical structure
- information retrieval
- distributed information retrieval
- heterogeneous collections
- authorship attribution
- relevant documents
- document set
- document clustering
- document retrieval
- page layout
- related documents
- pdf files
- document repositories
- document classification
- metadata
- document image retrieval
- web documents
- database
- digital libraries
- text documents
- xml documents
- retrieval systems
- test collection
- user queries
- meta information
- document representation
- free text
- controlled vocabulary
- effective retrieval
- web search
- retrieved documents
- relational databases
- structured documents
- relevance judgments
- vector space