Taming the Data: Web-Scraping and De-Duplicating Messy Multilingual Philosophy Corpora.
Raluca A. TanasescuCristian A. MarocicoPublished in: DH (2020)
Keyphrases
- data sets
- database
- data analysis
- data collection
- web data
- raw data
- data structure
- data quality
- data sources
- text classification
- textual data
- high dimensional data
- information sources
- knowledge discovery
- end users
- input data
- data mining techniques
- data points
- spatial data
- web mining
- data objects
- digital libraries
- high quality
- training data
- data extraction
- unstructured information