supermatt a day ago

What is the difference like with batching?

It seems all these tests only compare a single prompt at a time, which is just going to be throttled by memory bandwidth (faster on 3090) and clock speed (faster on 5060) for the most part.

The 3090 has almost 3x the cores of a 5060, so I’m guessing it will absolutely wipe the floor with the dual 5060 setup for batched inference - which is increasingly essential for agentic workflows and complex tool use.

omneity 20 hours ago

I am not entirely surprised by the relative equivalence for the sparse model. The combined bandwidth of 2x 5060 Ti ≃ 1x 3090. There are inefficiencies in multi-gpus that are more negligible at smaller dimensions, hence why the dense 32B model performs significantly worse on the dual 5060 setup.

For reference I am getting ~40 output tok/s on a 4090 (450W) with Qwen3 32B and a context window of 4096.

> Ultimately, as the user note aptly put it, the decision largely boils down to how much context you anticipate using regularly.

Hah. (emphasis mine)

esafak 21 hours ago

Reading this gave me flashbacks to the 80s, when tinkerers tried to move utilities into the upper- and extended memory area to free up precious conventional memory, 640KB of which we were told ought to have been "enough for anyone". All this because we were saddled with a 16-bit OS. This is not an LLM problem -- 32GB of memory is peanuts in 2025 -- this is an Intel and AMD problem.

  • zamadatix 20 hours ago

    As the article highlights the problem is really twofold. You need enough VRAM to load the model at all but there also needs to be enough bandwidth that accessing all of that memory is fast enough to be worthwhile. It'd be "easy" to slap 2 TB of "slow" DDR5 onto a GPU but it wouldn't perform much better than a high core count CPU running LLMs with the same memory.

Havoc a day ago

One substantial downside is other uses. e.g. I also use my desktop for gaming. And a 3090 beats a 5060 easily on that. By a sizable margin - ~33% on some games

Not sure I'd trade more LLM vram for that.