Towards a Cleaner Document-Oriented Multilingual Crawled Corpus.
Julien AbadjiPedro Javier Ortiz SuárezLaurent RomaryBenoît SagotPublished in: LREC (2022)
Keyphrases
- document corpus
- multilingual information retrieval
- document images
- multilingual documents
- information retrieval systems
- information retrieval
- document level
- scientific papers
- co occurrence
- document collections
- document clustering
- temporal expressions
- text corpus
- retrieval systems
- similar documents
- word co occurrence
- word sense
- manually annotated
- text collections
- training corpus
- cross language
- cross lingual
- document classification
- language model
- keywords
- coreference resolution
- document analysis
- multiword
- parallel corpus
- keyword extraction
- relevant documents
- semantic information
- web documents