CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB.
Holger SchwenkGuillaume WenzekSergey EdunovEdouard GraveArmand JoulinPublished in: CoRR (2019)
Keyphrases
- high quality
- web mining
- website
- clickstream data
- web logs
- semantic web
- traversal patterns
- web applications
- web usage
- parallel processing
- data mining
- web users
- web documents
- web resources
- text mining
- web data
- information sources
- web data mining
- web pages
- log analysis
- web access
- low quality
- high resolution
- pattern mining
- sequential patterns
- mining algorithm
- database
- image quality
- web technologies
- linked data
- knowledge discovery
- web usage mining
- natural language
- multi document summarization
- association rule mining
- itemsets
- keywords
- unstructured information
- huge data
- data mining techniques
- end users