Trained on 100 million words and still in shape: BERT meets British National Corpus.
David SamuelAndrey KutuzovLilja ØvrelidErik VelldalPublished in: EACL (Findings) (2023)
Keyphrases
- english words
- word frequencies
- text corpus
- training corpus
- multiword
- text corpora
- shape features
- shape model
- unknown words
- n gram
- united states
- word sense disambiguation
- shape representation
- linguistic information
- word pairs
- person names
- related words
- spontaneous speech
- textual features
- training set
- noun phrases
- shape matching
- document level
- manually annotated
- shape analysis
- word frequency
- parallel texts
- world knowledge
- text classification
- e government
- information extraction
- word co occurrence
- conversational speech
- keywords
- news corpus