A 500 Million Word POS-Tagged Icelandic Corpus.
Thomas EckartErla HallsteinsdóttirSigrún HelgadóttirUwe QuasthoffDirk GoldhahnPublished in: LREC (2014)
Keyphrases
- unknown words
- training corpus
- part of speech
- ambiguous words
- linguistic information
- n gram
- word sense disambiguation
- news corpus
- linguistic features
- multiword
- word frequencies
- text corpus
- noun phrases
- word sense
- sentence level
- pos tagging
- english words
- text classification
- lexical features
- word pairs
- statistical machine translation
- syntactic information
- linguistic knowledge
- morphological analysis
- news articles
- pos taggers
- co occurrence
- translation model
- word segmentation
- parallel corpus
- natural language processing
- natural language text
- word co occurrence
- wordnet
- syntactic categories
- parallel corpora
- word frequency
- semantic relations
- machine translation
- text documents
- language model
- sentence pairs