Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.
Jesse DodgeMaarten SapAna MarasovicWilliam AgnewGabriel IlharcoDirk GroeneveldMargaret MitchellMatt GardnerPublished in: EMNLP (1) (2021)
Keyphrases
- text corpora
- annotated corpus
- closed itemsets
- statistical machine translation
- parallel corpus
- wide coverage
- training corpus
- document corpus
- association rule mining
- topic segmentation
- text data
- linguistic patterns
- manually annotated
- text corpus
- parallel corpora
- word frequency
- sentence pairs
- natural language processing
- hand crafted
- specific domains
- chinese english
- data analysis
- frequent itemsets
- named entities
- association rules