Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • melfie@lemy.lol
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    1 day ago

    Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

    So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

    Edit:

    Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

    Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

    I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

    • Schadrach@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      23 hours ago

      There are models with open weights, and you can run those locally on your GPU. It can be a bit slower depending on model and GPU. For example, GLM has an open version, both full and pruned, but it’s not the newest version. A bunch of image generation models have local versions too.