Towards a Cleaner Document-Oriented Multilingual Crawled Corpus.
Julien AbadjiPedro Javier Ortiz SuárezLaurent RomaryBenoît SagotPublished in: CoRR (2022)
Keyphrases
- document corpus
- text corpus
- multilingual documents
- information retrieval
- document clustering
- multilingual information retrieval
- document level
- cross language information retrieval
- parallel corpus
- document classification
- text documents
- digital libraries
- information retrieval systems
- retrieval systems
- document images
- similar documents
- document collections
- manually annotated
- noun phrases
- document retrieval
- document space
- comparable corpora
- semantic information
- cross lingual
- cross language
- relevant documents
- co occurrence
- word co occurrence
- machine translation
- scientific papers
- user queries
- language independent
- word sense
- wordnet
- document representation
- vector space model
- training corpus
- historical documents
- document analysis
- text collections