The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild.
Taja KuzmanPeter RupnikNikola LjubesicPublished in: CoRR (2022)
Keyphrases
- training dataset
- web documents
- web data
- training data
- web information
- multilingual documents
- web pages
- website
- document collections
- web search engines
- training set
- content similarity
- web content
- information extraction
- textual data
- data samples
- document repositories
- information retrieval
- web mining
- training samples
- web applications
- support vectors
- open directory project
- information retrieval systems
- keywords
- text documents
- xml documents
- digital documents
- web crawler
- search engine
- relevant documents
- semantic web
- learning environment
- reinforcement learning
- high quality
- metadata