hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization.
Piyush MakhijaAnkit KumarAnuj GuptaPublished in: CoRR (2020)
Keyphrases
- statistical machine translation
- machine translation system
- proper names
- training corpus
- noun phrases
- link grammar
- sentence level
- machine translation
- broad coverage
- source language
- multiword
- mono lingual
- target language
- word sense
- text corpus
- lexical features
- linguistic features
- open domain
- language identification
- cross lingual
- english words
- indian languages
- syntactic analysis
- natural language
- penn treebank
- parallel corpus
- linguistic analysis
- comparable corpora
- word alignment
- linguistic patterns
- plain text
- parallel corpora
- query translation
- document level
- semantic roles
- contextual features
- sentence pairs
- english text
- natural language text
- tree bank
- word pairs
- semantic parsing
- text corpora
- text mining
- word level
- text summarization
- named entities
- arabic language
- syntactic features
- text classification
- text documents
- text to speech
- information extraction
- information retrieval
- natural language processing
- named entity recognition
- sentiment classification
- multi document summarization
- semantic relations
- unknown words
- co occurrence
- spoken language
- word order
- natural language generation
- syntactic categories
- automatic summarization
- question answering
- wordnet
- named entity recognizer