Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets.
Isaac CaswellJulia KreutzerLisa WangAhsan WahabDaan van EschNasanbayar Ulzii-OrshikhAllahsera TapoNishant SubramaniArtem SokolovClaytone SikasoteMonang SetyawanSupheakmungkol SarinSokhar SambBenoît SagotClara RiveraAnnette RiosIsabel PapadimitriouSalomey OseiPedro Javier Ortiz SuárezIroro OrifeKelechi OguejiRubungo Andre NiyongaboToan Q. NguyenMathias MüllerAndré MüllerShamsuddeen Hassan MuhammadNanda MuhammadAyanda MnyakeniJamshidbek MirzakhalovTapiwanashe MatangiraColin LeongNze LawsonSneha KuduguntaYacine JerniteMathias JennyOrhan FiratBonaventure F. P. DossouSakhile DlaminiNisansa de SilvaSakine Çabuk BalliStella BidermanAlessia BattistiAhmed BaruwaAnkur BapnaPallavi BaljekarIsrael Abebe AzimeAyodele AwokoyaDuygu AtamanOrevaoghene AhiaOghenefego AhiaSweta AgrawalMofetoluwa AdeyemiPublished in: CoRR (2021)
Keyphrases
- web pages
- website
- high quality
- web applications
- multilingual documents
- database
- quality assurance
- multi lingual
- semantic web
- information sources
- web content
- intrusion detection
- web documents
- web mining
- web crawling
- digital libraries
- linked data
- web data
- information access
- data sets
- log analysis
- data quality
- quality assessment
- text categorization
- text classification
- data mining techniques
- information retrieval