CCOHA: Clean Corpus of Historical American English.
Reem AlatrashDominik SchlechtwegJonas KuhnSabine Schulte im WaldePublished in: LREC (2020)
Keyphrases
- link grammar
- person names
- broad coverage
- open domain
- statistical machine translation
- parallel corpus
- wide coverage
- english words
- training corpus
- multiword
- mono lingual
- linguistic features
- machine translation
- cross lingual
- united states
- sentence pairs
- word sense
- english language
- penn treebank
- semantic roles
- machine translation system
- language learning
- unknown words
- pos tagging
- natural language
- lexical units
- comparable corpora
- chinese english
- cross language
- english text
- historical data
- parallel corpora
- named entities
- text classification
- answer questions
- cross language information retrieval
- part of speech
- query translation
- word pairs
- spontaneous speech
- historical documents
- test set
- tree bank
- co occurrence