The (ab)use of Open Source Code to Train Large Language Models.
Ali Al-KaswanMaliheh IzadiPublished in: NLBSE@ICSE (2023)
Keyphrases
- source code
- language model
- language modeling
- open source
- software systems
- information retrieval
- document retrieval
- speech recognition
- n gram
- open source software
- probabilistic model
- software projects
- query expansion
- statistical language models
- test collection
- retrieval model
- software maintenance
- language modelling
- plagiarism detection
- text files
- free software
- smoothing methods
- language models for information retrieval
- vector space model
- document ranking
- document representation
- relevance model
- software evolution
- software repositories
- program understanding
- cross lingual
- high level
- source files
- search engine