Building a 70 billion word corpus of English from ClueWeb.
Jan PomikálekMilos JakubícekPavel RychlýPublished in: LREC (2012)
Keyphrases
- english words
- statistical machine translation
- unknown words
- parallel corpus
- multiword
- training corpus
- word sense
- sentence pairs
- stop words
- link grammar
- co occurrence
- word frequencies
- person names
- machine translation system
- cross lingual
- text corpus
- open domain
- machine translation
- lexical information
- english text
- word alignment
- natural language text
- chinese english
- word sense disambiguation
- parallel corpora
- sentence level
- part of speech
- word pairs
- bilingual dictionaries
- wide coverage
- test collection
- broad coverage
- linguistic information
- comparable corpora
- n gram
- spontaneous speech
- lexical features
- language model
- word level
- text classification
- computing semantic relatedness
- compound words
- keywords
- recognizing textual entailment
- semantic roles
- language independent
- query translation