Improving corpus reproducibility through modular text transformations and connected data set.
Jonathan PullizaChirag ShahPublished in: ASIST (2018)
Keyphrases
- data sets
- text data
- open domain
- broad coverage
- supervised machine learning
- plain text
- newspaper articles
- text corpus
- lexical features
- sentence level
- text corpora
- text retrieval
- natural language text
- recognizing textual entailment
- real world
- document corpus
- free text
- anaphora resolution
- spontaneous speech
- text collections
- linguistic patterns
- database
- information retrieval
- text processing
- entity extraction
- information extraction systems
- text documents
- text mining
- linguistic information
- training data
- named entity disambiguation
- scientific papers
- training corpus
- document level
- word pairs
- multiword
- word sense
- noun phrases
- connected components
- test set
- keywords
- world knowledge
- conversational speech
- training set
- text classification
- web documents
- news articles
- topic segmentation
- topic tracking
- textual features