An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets.
Jonathan KatzyRazvan Mihai PopescuArie van DeursenMaliheh IzadiPublished in: CoRR (2024)
Keyphrases
- language model
- training dataset
- source code
- language modeling
- training data
- probabilistic model
- n gram
- document retrieval
- speech recognition
- information retrieval
- retrieval model
- mixture model
- language modelling
- class labels
- query expansion
- training set
- context sensitive
- test collection
- statistical language models
- ad hoc information retrieval
- query terms
- training samples
- translation model
- query specific
- smoothing methods
- data sets
- semi supervised learning
- automatic speech recognition
- decision trees
- document length
- trained classifiers