Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages.
Chunlan MaAyyoob ImaniHaotian YeEhsaneddin AsgariHinrich SchützePublished in: CoRR (2023)
Keyphrases
- text classification
- language independent
- cross lingual
- text classification tasks
- multi lingual
- bag of words
- n gram
- language specific
- cross lingual information retrieval
- text data
- language modeling
- feature selection
- text categorization
- multilingual information retrieval
- labeled data
- text mining
- machine learning
- text documents
- naive bayes
- multilingual documents
- cross language
- language resources
- benchmark datasets
- parallel corpora
- gps data
- text classifiers
- multilingual retrieval
- database
- knn
- data cleaning
- multi label
- expressive power
- data analysis
- digital libraries
- dublin city university