An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets.
Jonathan KatzyRazvan Mihai PopescuArie van DeursenMaliheh IzadiPublished in: FORGE (2024)
Keyphrases
- language model
- training dataset
- source code
- language modeling
- speech recognition
- training data
- n gram
- query expansion
- document retrieval
- probabilistic model
- information retrieval
- training set
- language modelling
- retrieval model
- statistical language models
- ad hoc information retrieval
- test collection
- mixture model
- pseudo relevance feedback
- training samples
- context sensitive
- machine learning
- word clouds
- translation model
- class labels
- high dimensional
- support vector