Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora.
Hainan XuPhilipp KoehnPublished in: EMNLP (2017)
Keyphrases
- data cleaning
- parallel corpora
- data integration
- web pages
- website
- record linkage
- data quality
- text classification
- web usage mining
- database
- outlier detection
- linked data
- machine translation
- web users
- web mining
- language independent
- information sources
- web content
- web documents
- web data
- cross lingual
- missing values
- cross language information retrieval
- information extraction
- information retrieval
- user generated content
- data model
- machine translation system
- databases