int4.llm

70B-class LLMs.
One $400 GPU.

Adaptive Bit Mixed quantization + Zero-Offload PCIe scheduling — running entirely on your own machine. Prompts and weights never leave the box.

Free tier at launch: 50M tokens / month, all model sizes (8B / 32B / 70B).

QuantizationAdaptive Bit Mixed

7-tier per-layer dispatcher (FP16 / INT8 / INT6 / INT4 / NF4 / INT3 / INT2). Hadamard preconditioning default-on, NF4 codebook for body layers. 70B compresses to 44 GB at 4.5 bits/param avg, 95.7 % quality retention measured on 8B.

ThroughputZero-Offload PCIe

70B doesn't fit in 16 GB VRAM. Phase B Zero-Offload scheduler pre-allocates pinned host RAM and overlaps weight prefetch with compute on CUDA streams. Target 8-10 tok/s — chat-grade, on a single $400 GPU.

PrivacyTruly local

Inference runs entirely on the customer's machine. The SaaS control plane only sees license validity and aggregate hardware telemetry. Prompt text and completions never transit the network.

© 2026 int4.llm — Ultra-Eco LLM Engine

Live performance

Updated 2026-06-20

品質保持率
95.7%vs fp16 PPL
Llama-3.1-8B / NVIDIA A100-SXM4-40GB
推論速度
1,140tok/s
Llama-3.2-1B / Tesla T4
電力比
1 / 1.8vs A6000
165W RTX 4060 Ti vs 300W

「品質保持率」「推論速度」は実機ベンチマーク。「電力比」はデプロイ先想定の TDP 比(W × t の積で同作業のエネルギー比に等しくなるのは速度同等時)。 詳細→ スケーリング根拠