EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation.
Paul BakerAndrew HardieTony McEneryHamish CunninghamRobert J. GaizauskasPublished in: LREC (2002)
Keyphrases
- data collection
- statistical machine translation
- sentence pairs
- word recognition
- word frequencies
- news corpus
- machine translation system
- word pairs
- training corpus
- parallel corpus
- text corpus
- cross lingual
- multiword
- grammar induction
- english words
- word sense
- parallel corpora
- data analysis
- target language
- natural language text
- machine translation
- sentence level
- language independent
- word order
- linguistic information
- sensor networks
- translation model
- unknown words
- comparable corpora
- noun phrases
- n gram
- lexical features
- co occurrence
- language model
- word frequency
- english text
- compound words
- bilingual dictionaries
- word level
- source language
- word co occurrence
- pos taggers
- word segmentation
- wordnet
- writing style
- word sense disambiguation
- wireless sensor networks
- data entry
- character n grams
- language identification
- ambiguous words
- data mining