Advice - Getting started with LLMs

its_me_xiphos@beehaw.org · 2 years ago

Advice - Getting started with LLMs

BaroqueInMind@lemmy.one · 2 years ago

OLlama is so fucking slow. Even with a 16-core overclocked Intel on 64Gb RAM with an Nvidia 3080 10Gb VRAM, using a 22B parameter model, the token generation for a simple haiku takes 20 minutes.

xcjs@programming.dev · 2 years ago

No offense intended, but are you sure it’s using your GPU? Twenty minutes is about how long my CPU-locked instance takes to run some 70B parameter models.

On my RTX 3060, I generally get responses in seconds.

kiku123@feddit.de · 2 years ago

I agree. My 3070 runs the 8B Llama3 model in about 250ms, especially for short responses.

Zworf@beehaw.org · 2 years ago

Hmmm weird. I have a 4090 / Ryzen 5800X3D and 64GB and it runs really well. Admittedly it’s the 8B model because the intermediate sizes aren’t out yet and 70B simply won’t fly on a single GPU.

But it really screams. Much faster than I can read. PS: Ollama is just llama.cpp under the hood.

Edit: Ah, wait, I know what’s going wrong here. The 22B parameter model is probably too big for your VRAM. Then it gets extremely slow yes.

BaroqueInMind@lemmy.one · 2 years ago

What is the appropriate size for 10Gb VRAM?