The (Ab)use of Open Source Code to Train Large Language Models.
Ali Al-KaswanMaliheh IzadiPublished in: CoRR (2023)
Keyphrases
- source code
- language model
- open source
- language modeling
- software systems
- language modelling
- n gram
- information retrieval
- speech recognition
- probabilistic model
- statistical language models
- document retrieval
- query expansion
- software maintenance
- retrieval model
- software projects
- open source software
- test collection
- software repositories
- high level
- plagiarism detection
- smoothing methods
- legacy software
- language models for information retrieval
- vector space model
- program understanding
- software evolution
- document ranking
- manual inspection
- relevance model
- statistical language modeling
- case study