CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code.

Nadezhda Chirkova Sergey Troshin

Published in: CoRR (2023)

Keyphrases

source code
language model
language modeling
open source
software systems
document retrieval
probabilistic model
n gram
speech recognition
retrieval model
information retrieval
query expansion
test collection
software projects
software repositories
software maintenance
context sensitive
ad hoc information retrieval
smoothing methods
relevance model
document representation
query terms
mixture model
translation model
plagiarism detection
text files
free software
case study
software evolution
program understanding