Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents.
Veronika LaippalaJesse EgbertDouglas BiberAki-Juhani KyröläinenPublished in: Lang. Resour. Evaluation (2021)
Keyphrases
- web documents
- information extraction
- web pages
- broad coverage
- web search engines
- keywords
- vector space model
- semi structured
- document classification
- textual information
- natural language
- web data
- link structure
- web content
- knowledge discovery
- document representation
- structured documents
- html documents
- unstructured documents
- focused crawling