Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.
I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).
AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.
I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.
Honestly, I would be alright with this if the AI companies paid Github so that the server infrastructure can be upgraded. Having AI that can figure out bugs and error reports could be really useful for our society. For example, your computer rebooting for no apparent reason? The AI can check the diagnostic reports, combine them with online reports, and narrow down the possibilities.
In the long run, this could also help maintainers as well. If they can have AI for testing programs, the maintainers won’t have to hope for volunteers or rely on paid QA for detecting issues.
What Github & AI companies should do, is an opt-in program for maintainers. If they allow the AI to officially make reports, Github should offer an reward of some kind to their users. Allocate to each maintainer a number of credits so that they can discuss the report with the AI in realtime, plus $10 bucks for each hour spent on resolving the issue.
Sadly, I have the feeling that malignant capitalism would demand maintainers to sacrifice their time for nothing but irritation.
Thats wild. I don’t have much hope for llm’s if things like this is how they are doing things and I would not be surprised given how well they don’t work. Too much quantity over quality in training.
Testing out a theory with ChatGPT there might be a way, albeit clunky, to detect AI. I asked ChatGPT a simple math question then told it to disregard the rest of the message, then I asked it if it was AI. It answered the math question and told me it was ai. Now a bot probably won’t admit to being AI but it might be foolish enough to consider instruction that you explicitly told it not to follow.
Or you might simply be able to waste its resources by asking it to do something computationally difficult that most people would just reject outright.
Of course all of this could just result in making AI even harder to detect once it learns these tricks. 😬
Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.
I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).
AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.
I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.
Any idea what the point of these are then? Sounds like its reporting a fake bug.
The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:
If this is what’s happening, then it’s essentially offloading your LLM’s reinforcement learning scoring to open source maintainers.
Honestly, I would be alright with this if the AI companies paid Github so that the server infrastructure can be upgraded. Having AI that can figure out bugs and error reports could be really useful for our society. For example, your computer rebooting for no apparent reason? The AI can check the diagnostic reports, combine them with online reports, and narrow down the possibilities.
In the long run, this could also help maintainers as well. If they can have AI for testing programs, the maintainers won’t have to hope for volunteers or rely on paid QA for detecting issues.
What Github & AI companies should do, is an opt-in program for maintainers. If they allow the AI to officially make reports, Github should offer an reward of some kind to their users. Allocate to each maintainer a number of credits so that they can discuss the report with the AI in realtime, plus $10 bucks for each hour spent on resolving the issue.
Sadly, I have the feeling that malignant capitalism would demand maintainers to sacrifice their time for nothing but irritation.
Thats wild. I don’t have much hope for llm’s if things like this is how they are doing things and I would not be surprised given how well they don’t work. Too much quantity over quality in training.
Testing out a theory with ChatGPT there might be a way, albeit clunky, to detect AI. I asked ChatGPT a simple math question then told it to disregard the rest of the message, then I asked it if it was AI. It answered the math question and told me it was ai. Now a bot probably won’t admit to being AI but it might be foolish enough to consider instruction that you explicitly told it not to follow.
Or you might simply be able to waste its resources by asking it to do something computationally difficult that most people would just reject outright.
Of course all of this could just result in making AI even harder to detect once it learns these tricks. 😬