Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi.
Shantipriya ParidaShakshi PanwarKusum LataSanskruti MishraSambit SekharPublished in: CoRR (2024)
Keyphrases
- language identification
- indian languages
- spoken language
- statistical machine translation
- cross lingual
- case study
- machine translation
- language independent
- benchmark datasets
- databases
- word order
- target language
- named entity recognition
- test bed
- expressive power
- comparable corpora
- cross language information retrieval
- linguistic resources
- multi lingual
- xml documents
- feature selection