-
Notifications
You must be signed in to change notification settings - Fork 155
Accepting gen AI code seems to be potentially a bad idea #628
Description
Accepting gen AI code seems to be potentially a bad idea due to the plagiarism problem, also see here:
https://lcamtuf.substack.com/p/large-language-models-and-plagiarism
Bard didn’t merely copy facts when composing its answer; it lifted a good chunk of the text wholesale — wording, parentheses, non-US units, and all.
https://dl.acm.org/doi/10.1145/3543507.3583199
Our results suggest that [...] three types of plagiarism widely exist in LMs beyond memorization, [...] Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, [...] Plagiarized content can also contain individuals’ personal and sensitive information.
https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/
Large language models don’t “learn”—they copy. [...]
when prompted strategically by researchers, Claude delivered the near-complete
text of Harry Potter and the Sorcerer’s Stone, The Great Gatsby, 1984, and Frankenstein, [...]
This may be a massive legal liability for AI companies—one that could potentially
cost the industry billions of dollars in copyright-infringement judgments, [...] It
also contradicts the basic explanation given by the AI industry for how its technology works.
Co-Pilot team apparently confirming they use GPL code for training:
I reached out to the team about this. Apparently all public GitHub code was used in training. We don't distinguish by license type.
Claude team apparently confirming they use GPL code for training:
The following sources of training data may contain personal data: [...] Publicly available information via the Internet
The available info on this topic suggests that for all major code models the training data typically doesn't seem to be 100% licensed in some way that doesn't require attribution, and at the same time plagiarism in the output of the training data seems to be the rule rather than to be the exception.
Therefore, it seems like the natural conclusion that gen AI code commits may typically violate the GPL and other licenses. Since I recently saw somebody use Claude for flatpak commits, I thought I would bring it up.
It also seems like the productivity gains seem potentially subjective and perhaps not quite there in the long term:
https://fortune.com/2025/07/20/ai-hampers-productivity-software-developers-productivity-study/
https://futurism.com/ai-coding-programmers-reality
https://www.zdnet.com/article/ai-failed-test-on-remote-freelance-jobs/
https://sebgnotes.com/blog/2025-01-30-the-hidden-cost-of-ai-assisted-development-skill-erosion/
And there are concerns around LLM perhaps not having the logical thought needed to deal with non-trivial code:
https://machinelearning.apple.com/research/illusion-of-thinking
PS: I'm not a lawyer, this isn't legal advice. Check the sources for yourself to make up your mind.