CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code.
Nadezhda ChirkovaSergey TroshinPublished in: CoRR (2023)
Keyphrases
- source code
- language model
- language modeling
- open source
- software systems
- document retrieval
- probabilistic model
- n gram
- speech recognition
- retrieval model
- information retrieval
- query expansion
- test collection
- software projects
- software repositories
- software maintenance
- context sensitive
- ad hoc information retrieval
- smoothing methods
- relevance model
- document representation
- query terms
- mixture model
- translation model
- plagiarism detection
- text files
- free software
- case study
- software evolution
- program understanding