Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi.

Shantipriya Parida Shakshi Panwar Kusum Lata Sanskruti Mishra Sambit Sekhar

Published in: CoRR (2024)

Keyphrases

language identification
indian languages
spoken language
statistical machine translation
cross lingual
case study
machine translation
language independent
benchmark datasets
databases
word order
target language
named entity recognition
test bed
expressive power
comparable corpora
cross language information retrieval
linguistic resources
multi lingual
xml documents
feature selection