70B-class LLMs.
One $400 GPU.
Adaptive Bit Mixed quantization + Zero-Offload PCIe scheduling — running entirely on your own machine. Prompts and weights never leave the box.
Free tier at launch: 50M tokens / month, all model sizes (8B / 32B / 70B).
7-tier per-layer dispatcher (FP16 / INT8 / INT6 / INT4 / NF4 / INT3 / INT2). Hadamard preconditioning default-on, NF4 codebook for body layers. 70B compresses to 44 GB at 4.5 bits/param avg, 95.7 % quality retention measured on 8B.
70B doesn't fit in 16 GB VRAM. Phase B Zero-Offload scheduler pre-allocates pinned host RAM and overlaps weight prefetch with compute on CUDA streams. Target 8-10 tok/s — chat-grade, on a single $400 GPU.
Inference runs entirely on the customer's machine. The SaaS control plane only sees license validity and aggregate hardware telemetry. Prompt text and completions never transit the network.