A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets.
Tanja SamardzicXimena Gutierrez-VasquesChristian BentzSteven MoranOlga PelloniPublished in: CoRR (2024)
Keyphrases
- data sets
- natural language processing
- natural language
- diversity measures
- linguistic knowledge
- similarity measure
- linguistic analysis
- database
- text generation
- text mining
- data streams
- knowledge representation
- information extraction
- real world
- lexical semantics
- synthetic data
- correlation coefficient
- benchmark data sets
- training data
- digital libraries
- hand crafted
- part of speech
- cross lingual
- free text
- question answering
- wordnet
- distance measure