Accepting gen AI code seems to be potentially a bad idea


Accepting gen AI code seems to be potentially a bad idea due to the [plagiarism problem](https://github.com/Vxrpenter/AIMania/blob/main/WHY.md#extensive-plagiarism), also see here:

https://lcamtuf.substack.com/p/large-language-models-and-plagiarism

> Bard didn’t merely copy facts when composing its answer; it lifted a good chunk of the text wholesale — wording, parentheses, non-US units, and all.

https://dl.acm.org/doi/10.1145/3543507.3583199

> Our results suggest that [...] three types of plagiarism widely exist in LMs beyond memorization, [...] Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, [...] Plagiarized content can also contain individuals’ personal and sensitive information.

https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/

> Large language models don’t “learn”—they copy. [...]
   when prompted strategically by researchers, Claude delivered the near-complete
   text of Harry Potter and the Sorcerer’s Stone, The Great Gatsby, 1984, and Frankenstein, [...]
   This may be a massive legal liability for AI companies—one that could potentially
   cost the industry billions of dollars in copyright-infringement judgments, [...] It
   also contradicts the basic explanation given by the AI industry for how its technology works.

[Co-Pilot team apparently confirming they use GPL code for training:](https://archive.is/1EzVK)

> I reached out to the team about this. Apparently all public GitHub code was used in training. We don't distinguish by license type.

[Claude team apparently confirming they use GPL code for training:](https://privacy.claude.com/en/articles/10023555-how-do-you-use-personal-data-in-model-training)

>  The following sources of training data may contain personal data: [...] Publicly available information via the Internet

The available info on this topic suggests that for all major code models the training data typically doesn't seem to be 100% licensed in some way that doesn't require attribution, and at the same time plagiarism in the output of the training data seems to be the rule rather than to be the exception.

Therefore, it seems like the natural conclusion that gen AI code commits may typically violate the GPL and other licenses. Since I recently saw somebody use Claude for flatpak commits, I thought I would bring it up.

It also seems like the productivity gains seem potentially subjective and perhaps not quite there in the long term:

https://fortune.com/2025/07/20/ai-hampers-productivity-software-developers-productivity-study/

https://futurism.com/ai-coding-programmers-reality

https://www.zdnet.com/article/ai-failed-test-on-remote-freelance-jobs/

https://sebgnotes.com/blog/2025-01-30-the-hidden-cost-of-ai-assisted-development-skill-erosion/

And there are concerns around LLM perhaps not having the logical thought needed to deal with non-trivial code:

https://machinelearning.apple.com/research/illusion-of-thinking

PS: I'm not a lawyer, this isn't legal advice. Check the sources for yourself to make up your mind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accepting gen AI code seems to be potentially a bad idea #628

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Accepting gen AI code seems to be potentially a bad idea #628

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions