We've developed extreme quantization techniques that retain high accuracy, including the world's first usable pure 1-bit LLM.
Our kernels run Llama-3-8b at 200 tokens/sec on consumer GPUs.
Our optimizations let you run 40b LLMs on consumer gaming GPUs, drastically reducing cost (e.g. Llama-3-70b on A6000 for 0.5$/hour).