Language-Independent Text Parsing of Arbitrary HTML-Documents. Towards A Foundation For Web Genre Identification.
Georg RehmPublished in: LDV Forum (2005)
Keyphrases
- web documents
- language independent
- html documents
- n gram
- multi lingual
- text retrieval
- web pages
- web content
- machine translation
- information extraction
- web data
- semi structured
- web search engines
- keywords
- text classification
- natural language processing
- structured documents
- cross lingual
- semantic information
- document representation
- text mining
- cross language
- syntactic categories
- database
- test collection
- wordnet
- website
- feature selection
- artificial intelligence
- information retrieval
- machine learning