Deidentifying a Corpus of 100 Million Clinical Text Documents for Information Extraction: Lessons Learned.
Lakshmi RadhakrishnanGundolf SchenkKathlene MuenzenBoris OskotskySharat IsraniAtul J. ButtePublished in: AMIA (2022)
Keyphrases
- lessons learned
- text documents
- information extraction
- information extraction systems
- text corpus
- text data
- text collections
- text mining
- text corpora
- linguistic patterns
- natural language text
- natural language processing
- named entities
- free text
- relation extraction
- news articles
- extraction patterns
- manually annotated
- information retrieval
- named entity recognition
- textual data
- case study
- document clustering
- clinical guidelines
- structured data
- document classification
- web documents
- machine learning
- question answering
- wordnet
- keywords
- text processing
- text classification
- training documents
- text representation
- real world
- bag of words
- text categorization
- data sets
- topic models
- image classification
- data analysis
- automatic text categorization