CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web.
Holger SchwenkGuillaume WenzekSergey EdunovEdouard GraveArmand JoulinAngela FanPublished in: ACL/IJCNLP (1) (2021)
Keyphrases
- high quality
- web mining
- web usage
- web logs
- website
- web pages
- web applications
- clickstream data
- traversal patterns
- text mining
- data mining
- web data
- web content
- web access
- parallel processing
- web documents
- semantic web
- information sources
- web users
- log analysis
- pattern mining
- linked data
- mining algorithm
- end users
- data mining techniques
- knowledge discovery
- researchers and practitioners interested
- sequential patterns
- huge data
- natural language processing
- web search
- parallel computing
- low quality
- link analysis