OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents.
Hugo LaurençonLucile SaulnierLéo TronchonStas BekmanAmanpreet SinghAnton LozhkovThomas WangSiddharth KaramchetiAlexander M. RushDouwe KielaMatthieu CordVictor SanhPublished in: CoRR (2023)
Keyphrases
- web scale
- million images
- text documents
- web images
- image data
- image content
- image search
- text classification
- text categorization
- image representation
- bag of words
- image features
- text mining
- image classification
- wordnet
- image collections
- keywords
- topic models
- information extraction
- multiscale
- image segmentation
- visual features
- image retrieval
- machine learning
- image regions
- high dimensional
- information retrieval