The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild.
Taja KuzmanPeter RupnikNikola LjubesicPublished in: LREC (2022)
Keyphrases
- training dataset
- web documents
- web data
- web pages
- training data
- web information
- web content
- website
- web mining
- multilingual documents
- web applications
- information retrieval
- open directory project
- content similarity
- document repositories
- information extraction
- training samples
- document collections
- training set
- digital documents
- textual data
- support vectors
- class labels
- retrieval systems
- database
- xml documents
- keywords
- relevant documents
- search engine
- information retrieval systems
- data samples
- machine learning
- learning algorithm
- metadata
- digital libraries
- search interface
- web search engines
- user queries
- document clustering
- text documents