CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code.
Nadezhda ChirkovaSergey TroshinPublished in: ICLR (2023)
Keyphrases
- source code
- language model
- language modeling
- open source
- software systems
- document retrieval
- n gram
- probabilistic model
- information retrieval
- software projects
- query expansion
- speech recognition
- mixture model
- software maintenance
- test collection
- software evolution
- high level
- retrieval model
- vector space model
- free software
- document representation
- context sensitive
- machine learning
- relevance model
- object oriented
- word segmentation
- plagiarism detection
- smoothing methods
- program understanding
- text files