A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora.
Eduard BarbuVerginica Barbu MititeluPublished in: WMT (shared task) (2018)
Keyphrases
- machine learning
- parallel corpora
- web pages
- cross language information retrieval
- web documents
- feature selection
- active learning
- information extraction
- web mining
- machine translation
- information sources
- query translation
- web data
- user experience
- text mining
- natural language processing
- data mining
- word pairs
- language independent
- information retrieval
- user generated content
- cross lingual
- text classification
- knowledge discovery
- artificial intelligence