CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code.

Nadezhda Chirkova Sergey Troshin

Published in: ICLR (2023)

Keyphrases

source code
language model
language modeling
open source
software systems
document retrieval
n gram
probabilistic model
information retrieval
software projects
query expansion
speech recognition
mixture model
software maintenance
test collection
software evolution
high level
retrieval model
vector space model
free software
document representation
context sensitive
machine learning
relevance model
object oriented
word segmentation
plagiarism detection
smoothing methods
program understanding
text files