Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus.
Martin ReynaertNelleke OostdijkOrphée De ClercqHenk van den HeuvelFranciska de JongPublished in: LREC (2010)
Keyphrases
- word frequencies
- news corpus
- text corpus
- mobile robot
- multiword
- english words
- real time
- co occurrence
- real world
- unknown words
- sentence level
- word recognition
- linguistic information
- scalability issues
- parallel corpus
- stop words
- writing style
- noun phrases
- spontaneous speech
- lexical features
- n gram
- statistical machine translation
- word sense
- natural language text
- conversational speech
- keywords