• theluddite@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    “I gave an LLM a wildly oversimplified version of a complex human task and it did pretty well”

    For how long will we be forced to endure different versions of the same article?

    The study said 86.66% of the generated software systems were “executed flawlessly.”

    Like I said yesterday, in a post celebrating how ChatGPT can do medical questions with less than 80% accuracy, that is trash. A company with absolute shit code still has virtually all of it “execute flawlessly.” Whether or not code executes it not the bar by which we judge it.

    Even if it were to hit 100%, which it does not, there’s so much more to making things than this obviously oversimplified simulation of a tech company. Real engineering involves getting people in a room, managing stakeholders, navigating conflicting desires from different stakeholders, getting to know the human beings who need a problem solved, and so on.

    LLMs are not capable of this kind of meaningful collaboration, despite all this hype.

    • PlexSheep@feddit.de
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      1 year ago

      Thank you for writing this so I only have to upvore upvote you.

      Edit: What the difference between one key can be

  • blazera@kbin.social
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Researchers, for example, tasked ChatDev to “design a basic Gomoku game,” an abstract strategy board game also known as “Five in a Row.”

    What tech company is making Connect Four as their business model?

    • realharo@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 year ago

      This is also the kind of task you would expect it to be great at - tutorial-friendly project for which there are tons of examples and articles written online, that guide the reader from start to finish.

      The kind of thing you would get a YouTube tutorial for in 2016 with title like “make [thing] in 10 minutes!”. (see https://www.google.com/search?q=flappy+bird+in+10+minutes)

      Other things like that include TODO lists (which is even used as a task for framework comparisons), tile-based platformer games, wordle clones, flappy bird clones, chess (including online play and basic bots), URL shorteners, Twitter clones, blogging CMSs, recipe books and other basic CRUD apps.

      I wasn’t able to find a list of tasks in the linked paper, but based on the gomoku one, I suspect a lot of it will be things like these.

  • kitonthenet@kbin.social
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    At the designing stage, the CEO asked the CTO to “propose a concrete programming language” that would “satisfy the new user’s demand,” to which the CTO responded with Python. In turn, the CEO said, “Great!” and explained that the programming language’s “simplicity and readability make it a popular choice for beginners and experienced developers alike.”

    I find it extremely funny that project managers are the ones chatbots have learned to immitate perfectly, they already were doing the robot’s work: saying impressive sounding things that are actually borderline gibberish

    • realharo@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 year ago

      And ironically Python (with Pygame which they also used) is a terrible choice for this kind of game - they ended up making a desktop game that the user would have to download. Not playable on the web, not usable for a mobile app.

      More interestingly, if decisions like these are going to be made even more based on memes and random blogposts, that creates some worrying incentives for even more spambots. Influence the training data, and you’re influencing the decision making. It kind of works like that for people too, but with AI, it’s supercharged to the next level.

    • thanks_shakey_snake@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      What does it even mean for a programming language to “satisfy the new user’s demand?” Like when has the user ever cared whether your app is built in Python or Ruby or Common Lisp?

      It’s like “what notebook do I need to buy to pass my exams,” or “what kind of car do I need to make sure I get to work on time?”

      Yet I’m 100% certain that real human executives have had equivalent conversations.

  • atzanteol@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This research seems to be more focused on whether the bots would interoperate in different roles to coordinate on a task than about creating the actual software. The idea is to reduce “halucinations” by providing each bot a more specific task.

    The paper goes into more about this:

    Similar to hallucinations encountered when using LLMs for natural language querying, directly generating entire software systems using LLMs can result in severe code hallucinations, such as incomplete implementation, missing dependencies, and undiscovered bugs. These hallucinations may stem from the lack of specificity in the task and the absence of cross-examination in decision- making. To address these limitations, as Figure 1 shows, we establish a virtual chat -powered software tech nology company – CHATDEV, which comprises of recruited agents from diverse social identities, such as chief officers, professional programmers, test engineers, and art designers. When presented with a task, the diverse agents at CHATDEV collaborate to develop a required software, including an executable system, environmental guidelines, and user manuals. This paradigm revolves around leveraging large language models as the core thinking component, enabling the agents to simulate the entire software development process, circumventing the need for additional model training and mitigating undesirable code hallucinations to some extent.

  • gencha@feddit.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    What a load of bullshit. If you have a group of researchers provide “minimal human input” to a bunch of LLMs to produce a laughable program like tic-tac-toe, then please just STFU or at least don’t tell us it cost $1. This doesn’t even have the efficiency of a Google search. This AI hype needs to die quick

  • m_r_butts@kbin.social
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Every company I’ve been at follows this cycle: offshore to Cognizant for pennies, C-suite gets a bonus for saving money. In about two years, fire Cognizant because they suck and your code is a disaster, onshore, get a bonus for solving a huge problem. In about two years, offshore to Cognizant and get a bonus for saving money. Repeat forever.

    This will follow the same rhythm but with different actors: the cheap labor is always there, and sometimes senior devs come in to replace the chatbots because the bots are failing in ways offshore can’t make up for: either fundamental design problems that shouldn’t have been used as a roadmap, or incompetently generated code that offshore assumes is correct because it compiles. This will all get built up and built around until it’s both a broken design AND deeply embedded in your stack. The new role of a senior dev will be contract work slicing these Gordian knots.

  • Knusper@feddit.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    the CTO responded with Python. In turn, the CEO said, “Great!” and explained that the programming language’s “simplicity and readability make it a popular choice for beginners and experienced developers alike.”

    Yep, that does sound like my CEO.

  • AutoTL;DR@lemmings.worldB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This is the best summary I could come up with:


    AI chatbots like OpenAI’s ChatGPT can operate a software company in a quick, cost-effective manner with minimal human intervention, a new study has found.

    Based on the waterfall model — a sequential approach to creating software — the company was broken down into four different stages, in chronological order: designing, coding, testing, and documenting.

    After assigning ChatDev 70 different tasks, the study found that the AI-powered company was able to complete the full software development process “in under seven minutes at a cost of less than one dollar,” on average — all while identifying and troubleshooting “potential vulnerabilities” through its “memory” and “self-reflection” capabilities.

    “Our experimental results demonstrate the efficiency and cost-effectiveness of the automated software development process driven by CHATDEV,” the researchers wrote in the paper.

    The study’s findings highlight one of the many ways powerful generative AI technologies like ChatGPT can perform specific job functions.

    Nevertheless, the study isn’t perfect: Researchers identified limitations, such as errors and biases in the language models, that could cause issues in the creation of software.


    The original article contains 639 words, the summary contains 172 words. Saved 73%. I’m a bot and I’m open source!