@xodoh74984

xodoh74984@lemmy.world · 2 months ago

Sorry for the slow reply, but I’ll piggyback on this thread to say that I tend to target models a little but smaller than my total VRAM to leave room for a larger context window – without any offloading to RAM.

As an example, with 24 GB VRAM (Nvidia 4090) I can typically get a 32b parameter model with 4-bit quantization to run with 40,000 tokens of context all on GPU at around 40 tokens/sec.

xodoh74984@lemmy.world · edit-2 2 months ago

I use open source 32b Chinese models almost exclusively, because I can run them on my own machine without being a data cow for the US tech oligarchs or the CCP.

I only use the larger models for little hobby projects, and I don’t care too much about who gets that data. But if I wanted to use the large models for something sensitive, the open source Chinese models are the more secure option IMO. Rather than get a “trust me bro” pinky promise from Closed AI or Anthropic, I can run Qwen or Kimi on a cloud GPU provider that offers raw compute by the hour without any data harvesting.