

Edit: also i have a very strong suspicion that someone will figure out a way to make most matrix multiplications in an LLM be sparse, doing mostly same shit in a different basis. An answer to a specific query does not intrinsically use every piece of information that LLM has memorized.
Like MoE (Mixture of Experts) models? This technique is already in use by many models - Deepseek, Llama 4, Kimi 2, Mixtral, Qwen3 30B and 235B, and many more. I read that GPT 4 was leaked and confirmed to use MoE, and Grok is confirmed to use MoE; I suspect most large, hosted, proprietary models are using MoE in some manner.
We need more unions like this one.