Documenting the English Colossal Clean Crawled Corpus.
Jesse DodgeMaarten SapAna MarasovicWilliam AgnewGabriel IlharcoDirk GroeneveldMatt GardnerPublished in: CoRR (2021)
Keyphrases
- link grammar
- statistical machine translation
- person names
- open domain
- parallel corpus
- broad coverage
- wide coverage
- closed itemsets
- english words
- training corpus
- machine translation
- multiword
- penn treebank
- machine translation system
- natural language
- word sense
- association rule mining
- sentence pairs
- unknown words
- semantic roles
- information extraction
- cross lingual
- parallel corpora
- itemset mining
- association rules
- web pages
- cross language information retrieval