Spec v7 — Unified Research Paper
Marow Research New York, NY — [email protected]
April 2026
Large language models require substantial compute for inference, yet every existing quantization method applies fixed precision determined offline. This paper introduces the Quantization Blueprint Engine (QBE), a runtime system that analyzes each incoming query and generates a per-layer precision assignment — a "blueprint" — that jointly specifies weight bit-widths, activation precisions, and KV-cache quantization parameters for every layer of a transformer model. Unlike static methods (GPTQ, AWQ, HIGGS, SpinQuant) that fix precision at model preparation time, unlike whole-model routing (PAQ) that selects among pre-quantized variants, and unlike offline RL-based assignment (RAMP) that determines bit-widths once per model, the QBE adapts precision dynamically for each query based on semantic complexity analysis.
The system integrates compiler-level JIT kernel synthesis via Triton, hierarchical blueprint caching with semantic fingerprinting, hardware-native execution on FP4/FP6/FP8 tensor cores, speculative decoding with adaptive precision gap control, SLA-driven precision routing, intra-forward-pass precision cascading driven by attention entropy, mid-generation blueprint refresh based on output entropy monitoring, energy-aware blueprinting, and type-aware adaptive quantization for hybrid transformer-SSM architectures with temporal-drift-constrained state precision.
To the best of our knowledge, no prior work jointly combines all of the following at runtime: per-query semantic analysis, per-layer weight precision selection, per-layer KV-cache adaptation, per-layer activation adaptation, mid-generation precision refresh, and temporal-drift-constrained SSM state protection — operating together for each individual query during autoregressive LLM inference.
The growing scale of transformer-based language models has significantly improved their ability to generate coherent, context-aware text. Yet this scale comes at a high price: inference remains prohibitively expensive for real-time use on consumer hardware, edge devices, and multi-tenant cloud systems.
Quantization methods reduce inference cost by lowering the precision of operations. Common techniques apply fixed bit-widths (e.g., 8-bit, 4-bit) across the entire model, regardless of what the input query demands. While effective in reducing compute, uniform quantization ignores the fact that not all queries or layers require the same precision. A simple factual lookup does not need the same computational depth as multi-step mathematical reasoning.
This paper proposes Quantization Blueprinting: a method for adaptively selecting per-layer quantization levels based on the complexity and nature of each query. The system includes a lightweight pre-inference analysis phase that evaluates the input and outputs a blueprint — a layer-by-layer precision assignment that guides model execution for that specific query.
As of early 2026, the landscape of LLM quantization is extensive (Section 3 surveys 30+ methods). Every method falls into one of three categories:
No existing system combines:
The QBE operates at the intersection of all six dimensions simultaneously.
The QBE comprises five modular subsystems, each responsible for a specific function in the blueprinting process.
Converts raw query features into a normalized complexity score between 0 and 1 using a weighted scoring algorithm. Input features include token count, prompt classification (summarization, dialogue, Q&A, code generation), semantic entropy (distributional unpredictability of token embeddings), and target output structure.
Uses the complexity score to select a policy and generate a quantization blueprint — a map assigning bit-widths to each transformer layer. The scheduler consults the blueprint cache (Section 8) before computing new blueprints.
Identifies functional roles of each transformer layer (token normalization, context routing, MoE dispatch, output synthesis) and adjusts precision assignments accordingly. Extended with an SSM Layer Classifier (220d) for hybrid architectures (Section 12).
Applies the generated bit-width configuration to the model in-place using tensor core-compatible operations. Transitions between bit-widths are performed using gear-based logic that avoids model recompilation. The dispatcher coordinates with the weight precision cache and JIT-compiled kernels.
Collects runtime metrics (memory usage, output fidelity, throughput, latency) and logs them for policy refinement and blueprint caching. Feeds into the proxy scorer training pipeline (Section 14).
The QBE control plane runs on CPU. Model inference runs on GPU. This separation ensures that blueprint computation does not compete with tensor core throughput. The CPU-side analysis overlaps with early-stage GPU operations (embedding lookup, position encoding), hiding blueprint generation latency behind useful computation.
HIGGS (Malinovskii et al., 2024, NAACL 2025) establishes a linearity theorem proving a direct relationship between layer-wise L2 reconstruction error and model perplexity increase. This provides theoretical justification for per-layer approaches but assigns precision offline. Dynamic HIGGS extends to non-uniform per-layer levels under a compression constraint — the optimal non-uniform data-free quantization technique — but still assigns at quantization time, not per query.
GPTQ (Frantar et al., 2022) performs one-shot weight quantization using approximate second-order information. AWQ (Lin et al., 2024) identifies salient weight channels and applies per-channel scaling. SpinQuant (Liu et al., Meta, 2024) learns optimal rotation matrices for W4A4KV4 quantization. All are static.
QuIP# (Tseng et al., ICML 2024) achieves the first PTQ where 3-bit outscales 4-bit using randomized Hadamard transforms and E8 lattice codebooks. AQLM (Egiazarian et al., 2024) uses learnable 8D additive codebooks below 3 bits. SqueezeLLM (Kim et al., ICML 2024) splits weights into dense and sparse components. All static.
PAQ (Prompt-Adaptive Quantization) routes entire prompts to one of several pre-quantized model variants via a ModernBERT router. This is query-adaptive but operates at whole-model granularity. The QBE is strictly more granular: it assigns precision per-layer within a single model instance, enabling configurations such as 4-bit attention layers with 8-bit feed-forward layers in the same forward pass. PAQ also requires maintaining multiple complete model copies in memory.
RAMP (March 2026) uses reinforcement learning (off-policy SAC) to determine per-layer bit-width assignments. This is the closest academic prior art. However, RAMP determines bit-widths once per model during an offline optimization phase; the resulting assignment is fixed. The QBE produces a unique configuration for each query.
DP-LLM (arXiv:2508.06041, August 2025, NeurIPS 2025) augments each linear layer with a lightweight precision selector that determines bit-width at runtime based on per-token input values. This is the closest prior art to runtime dynamic precision. However, DP-LLM operates at per-token granularity with learned error estimators per layer — it does not perform pre-query semantic analysis, does not generate a holistic blueprint, does not cache blueprints for reuse, and does not jointly control KV-cache or activation precision.
FlexQuant (arXiv:2506.12024, EMNLP 2025) introduces token-granularity dynamic precision switching using perplexity entropy and KL divergence to adjust bit-widths during decoding. Like DP-LLM, FlexQuant adapts during generation but at a per-token/per-layer level without query-level semantic blueprinting, hierarchical caching, or SSM support.
CoopQ (September 2025) formalizes mixed-precision quantization as a cooperative game using Shapley values to capture inter-layer interactions. The assignment is offline. ScaleBITS (February 2026) performs hardware-aligned block-wise partitioning with automated precision allocation. Also offline.
QAQ (ICCV 2025) recalculates per-token KV-cache quantization bits during inference — the closest prior art to runtime adaptive KV behavior. However, it adapts KV-cache only, not weight precision.
Cocktail (March 2025) performs query-aware KV-cache quantization using chunk-similarity scoring. MoQAE (ACL 2025) uses MoE routing to select KV-cache precision per chunk at runtime. Both are KV-cache only.
KVQuant (NeurIPS 2024), KIVI (ICML 2024), and Coupled Quantization (NeurIPS 2024) all advance KV-cache compression but with fixed precision.
SmoothQuant (Xiao et al., ICML 2023) migrates quantization difficulty from activations to weights via per-channel scaling, enabling W8A8. ViDiT-Q (ICLR 2025) proves that input-dependent quantization is both feasible and beneficial for diffusion transformers — empirically validating the QBE's core principle for a different architecture.
Quamba (2024) applies static PTQ to Mamba-family models. Quamba2 (2025) adds per-state-group quantization for selective scan parameters. OuroMamba (2025) addresses dynamic outlier variations per time step in vision Mamba. None perform per-query adaptation, none profile temporal drift, and none integrate with a unified weight/activation/KV-cache blueprint.
vLLM, TensorRT-LLM, and SGLang are the dominant serving frameworks. None support per-query adaptive precision selection. They all use a single fixed quantization configuration for the model's lifetime.
Fireworks AI (~$130M ARR, custom FireAttention kernels), Neural Magic (sparsity + quantization, acquired by Red Hat), and Unsloth (dynamic parameter selection during offline quantization) represent the commercial frontier. None perform per-query runtime adaptation.
| Approach | Adaptive? | Granularity | When? | Per-Query? |
|---|---|---|---|---|
| HIGGS Dynamic | Per-layer bit-width | Per-layer | Offline | No |
| PAQ | Model routing | Whole model | Runtime | Yes, coarse |
| RAMP | RL-based | Per-layer | Offline | No |
| CoopQ | Shapley-based | Per-layer | Offline | No |
| QAQ | Token bits | Per-token KV | Runtime | Partially |
| Cocktail | Chunk bits | Per-chunk KV | Runtime | KV only |
| MoQAE | MoE router | Per-chunk KV | Runtime | KV only |
| ViDiT-Q | Timestep | Per-layer | Runtime | Analogous |
| QBE | Semantic query | Per-layer, joint | Runtime | Yes |
Per-query adaptive quantization introduces a fundamental tension: if the system maintains separate weight copies at each precision tier, aggregate VRAM may exceed savings. If it quantizes on-the-fly from full precision per query, quantization latency may be unacceptable.
The system maintains a single canonical copy of model weights in FP16 (or BF16). This is loaded once at initialization and never modified. Its memory footprint equals standard FP16 deployment.
A cache indexed by (layer_id, bit_width) stores pre-quantized weight tensors as torchao AffineQuantizedTensor instances.
For a 70B parameter model with 80 layers supporting 3 precision tiers:
After cache warm-up, the canonical store can optionally be offloaded to host memory, reducing GPU memory to only the cached quantized variants.
Quantizing a large weight matrix from FP16 on a cache miss is memory-bandwidth-bound and cannot be hidden behind a microsecond embedding lookup for large models. The system mitigates this through aggressive pre-warming: on model load, the cache is populated with the 2-3 most common precision tiers across all layers using representative traffic profiles. In steady-state operation, cache misses are rare (< 1% of queries) and occur only for unusual precision tiers. When a miss does occur, the system falls back to the nearest cached precision tier rather than blocking on quantization, logging the miss for background cache population.
If the complexity analysis itself is expensive, it negates the inference savings. The QBE provides a tiered approach:
Simple threshold-based classification. Queries under N tokens get aggressive quantization; queries over M tokens get conservative quantization. Zero ML overhead. Suitable for latency-critical deployments.
A small classification model (logistic regression or shallow MLP) trained on prompt features: token count, vocabulary diversity, punctuation density, question markers. Runs on CPU concurrently with GPU embedding computation.
A distilled encoder that produces a semantic complexity vector from the input. The vector is compared against blueprint cluster centroids for nearest-neighbor lookup. This tier captures nuances (e.g., a short prompt with complex mathematical notation) that token-level features miss.
Deployments choose their complexity tier based on latency budget. Higher tiers produce better blueprints but cost more analysis time. The system defaults to Tier 2 with Tier 1 fast-path for queries under 32 tokens.
KV-cache memory scales linearly with sequence length and dominates VRAM in long-context inference. Existing methods (KVQuant, KIVI, QAQ) apply fixed precision. The QBE adapts KV-cache precision per layer per query based on context length and attention characteristics.
Within a single layer, different attention heads may receive different KV-cache precision based on attention entropy. Heads with diffuse attention (high entropy, model uncertainty) receive higher precision. Heads with focused attention (low entropy, confident retrieval) tolerate lower precision.
The system integrates with paged memory managers (vLLM). Quantization metadata is stored per page block. Different requests in a continuous batch maintain KV-caches at different precisions without batching penalty.
The blueprint schema extends beyond weight-only precision to include per-layer activation precision. A single blueprint entry specifies: (weight_bits, activation_bits, kv_bits) for each layer.
Following SmoothQuant, activation outliers are migrated to weights via per-channel scaling before quantization. The QBE can dynamically adjust the smoothing factor based on the query's expected activation distribution.
Recent work (Atom, QServe) introduces dynamic activation quantization with hardware-native block-wise scaling on Hopper/Blackwell. These methods adapt activation precision based on outlier statistics but apply the same strategy to all queries. The QBE's activation adaptation is query-driven: the blueprint specifies per-layer activation precision based on the semantic complexity of the input, complementing (not replacing) hardware-native scaling within each precision tier.
For high-entropy tokens (identified via attention patterns or logit distributions from preceding layers), the system assigns higher activation precision. Low-entropy tokens receive aggressive quantization. A mixed-precision kernel processes different token groups at their assigned precisions within a single kernel launch.
Blueprints are indexed by a semantic query fingerprint — a compact, fixed-length hash computed via locality-sensitive hashing (LSH) over the complexity feature vector. Similar queries map to similar fingerprints, enabling approximate cache hits.
On model load, the system pre-populates caches using a representative query set from production traffic logs. This eliminates cold-start latency for common query patterns.
A gossip protocol federates blueprint caches across nodes. When a node computes a new blueprint, it propagates the entry to peer nodes with configurable fan-out. This ensures optimized blueprints discovered on one node are available fleet-wide within seconds.
For each unique (bit_width, tensor_shape, hardware_target) triple, the system generates a Triton kernel specification at first encounter, compiles it to hardware-specific GPU binary, and caches the result. Subsequent encounters dispatch the cached kernel directly.
The QBE integrates with PyTorch's torch.compile via a custom backend that injects blueprint-aware precision annotations into the compiled computation graph. This enables the PyTorch compiler to fuse blueprint-specified precision transitions with surrounding operations.
Key fusion opportunities:
Triton JIT autotuning can spike 50-500ms on cache miss. For short prompts (< 32 tokens, ~30ms evaluation), this latency cannot be fully hidden. The system mitigates this through: (a) pre-compilation of common kernel variants during model warm-up, (b) speculative prefetch where blueprint lookups and kernel compilation overlap with early-stage GPU operations for longer prompts, and (c) fallback to pre-compiled generic kernels when JIT compilation would exceed the latency budget. In practice, the kernel cache reaches > 95% hit rate within the first hour of production traffic.
The QBE maps abstract precision tiers to hardware-native formats:
| Blueprint Tier | Blackwell | Hopper | Ampere |
|---|---|---|---|
| Ultra-low | NVFP4 | INT4 (tinygemm) | INT4 (tinygemm) |
| Low | FP6 | INT4 (CUTLASS) | INT4 (CUTLASS) |
| Medium | FP8 E4M3 | FP8 E4M3 | INT8 |
| High | FP16 | FP16 | FP16 |
At initialization, the hardware capability profiler detects available tensor core operations. The constraint adjustment module restricts the blueprint precision search space to formats natively supported, ensuring blueprints generated for one hardware generation are automatically remapped on different hardware.
Speculative decoding uses a small draft model to generate candidate tokens, verified in parallel by the full target model. Combined with adaptive quantization:
The system monitors rolling acceptance rate of draft tokens over a window of at least 16 tokens:
This adaptive loop maximizes the throughput benefit of speculative decoding while maintaining quality.
Recent production models (NVIDIA Nemotron, AI21 Jamba, Zamba) interleave transformer attention layers with state space model (SSM) layers. These hybrid architectures present fundamentally greater layer heterogeneity: attention layers, SSM recurrent state layers, SSM gating layers, and feed-forward layers coexist with different mathematical properties and different quantization sensitivities.
The critical distinction: attention layers are stateless within a forward pass (each computation is independent given the KV-cache), whereas SSM state layers are inherently sequential. The state from token t is composed into the state for token t+1 via a recurrence relation. Quantization error at token t propagates through all subsequent state updates, accumulating over sequence length.
The Nemotron 3 Super technical report (2026) confirms this: "quantization error in the Mamba cache does not remain local but propagates through the recurrence relation." Their solution is hand-tuned, fixed layer-type precisions. No prior work combines per-query adaptive quantization with systematic temporal drift profiling.
The QBE-LayerTagger is extended with an SSM Layer Classifier (220d) that identifies four SSM-specific layer types by inspecting the model computation graph:
During calibration, the QBE executes a temporal drift profiling protocol for each SSM state layer:
For ultra-long-context models where full-length calibration is prohibitive, the protocol extrapolates drift rate from shorter runs with a safety margin (default: 1.5×).
The JIT pipeline generates a fused dequantize-state-update-requantize kernel for SSM blocks. This kernel:
The recurrent hidden state tensor is maintained at blueprint precision throughout, in on-chip shared memory or registers. Only requantized output activations are materialized to global memory. This ensures the SSM state is never exposed to quantization noise from adjacent layers.
The QBE exposes quantization as a user-facing quality/cost control. Via an API endpoint, callers specify a per-request precision budget:
The system translates the budget into blueprint constraints that bound the precision search space, then selects a blueprint satisfying those constraints.
A single model instance simultaneously serves requests at different precision levels. An economy-tier request may receive aggressive INT4 quantization across most layers, while a premium-tier request receives FP8 or FP16. This enables differentiated API pricing from a unified deployment.
Token costs are computed based on the actual precision used, not a flat rate. Callers who accept lower precision pay less per token. This aligns economic incentives with compute efficiency.
Blueprint selection as described above is a single decision made before the forward pass. However, autoregressive generation produces many tokens, and difficulty can change substantially during a response. A query that begins with factual recall may transition to complex reasoning.
During generation, the system monitors output token entropy at configurable intervals (default: every 32 tokens):
A sustained entropy transition (crossing threshold for N consecutive intervals, default N=3) triggers blueprint refresh. Transient spikes do not trigger refresh, preventing precision oscillation.
On trigger, the system computes an updated complexity profile incorporating both original query features and observed entropy trajectory, selects a new blueprint, and applies it at the next layer boundary. KV-cache entries already stored at original precision remain unchanged; new entries use updated precision. The paged attention kernel handles mixed-precision pages within a single sequence.
This mechanism makes the patent substantially harder to design around. Any system that adapts precision at query time but not during generation leaves performance on the table for long-form outputs.
In continuous batching, multiple requests share GPU compute. If one request requires high precision, naive implementations force the entire batch to max precision, negating QBE benefits.
The system maintains separate pending request queues for each precision tier. Batches are assembled preferentially from single-tier queues. When a queue has insufficient requests within a latency deadline, adjacent tiers are merged at the higher tier.
For mixed-precision batches, the system splits the batch at the kernel level. Each sub-batch is dispatched to the appropriate precision kernel. This reduces arithmetic intensity compared to uniform-precision batching — reading different weight variants increases memory bandwidth pressure. The system mitigates this by coalescing requests into at most 2-3 precision tiers (not arbitrary per-request precision), ensuring sub-batches remain large enough for efficient GEMM utilization. In practice, production traffic clusters naturally into a small number of complexity bands.
KV-cache pages store per-page quantization metadata. Different requests in the same batch maintain KV-caches at different precisions without mutual interference.
For MoE models (Mixtral, DeepSeek-V3), router layers maintain 8-bit minimum to preserve routing accuracy. Per-expert layers receive variable quantization based on activation frequency: experts activated more than 10% of the time receive moderate quantization; rarely activated experts (< 1%) receive aggressive INT4/FP4.
Separate precision profiles for vision encoder and language decoder. Vision encoders typically require FP8/INT8 minimum due to spatial feature sensitivity.
For GQA/MQA models, KV-cache precision scales with the query-to-KV head ratio. Fewer KV heads means higher precision per head, as errors are amplified proportionally.
The blueprint schema accepts power budget constraints. Under thermal throttling or power caps, the system biases toward lower precision to reduce energy per token.
When integrated with grid carbon intensity APIs, the system can shift precision dynamically: more aggressive quantization during high-carbon periods, higher precision during low-carbon windows. This enables carbon-aware inference without changing the API contract.
The full blueprint selection pipeline (complexity analysis → scorer → cache lookup → blueprint generation) has non-trivial latency at Tier 3. A proxy scorer model learns to predict the optimal blueprint directly from raw query features, bypassing the full pipeline for common patterns.
A small neural network (< 1M parameters) trained on (query_features → blueprint) pairs collected from production traffic. The proxy scorer runs on CPU in < 50μs and handles 80-90% of queries. Complex or novel queries fall through to the full pipeline.
The proxy scorer is retrained periodically on accumulated (query, blueprint, quality_metric) triples. Over time, its coverage increases and the full pipeline is invoked less frequently.
The QBE builds on PyTorch's torchao library for all quantization primitives. The quantize_(model, config) API is extended with a per-query config selector that the QBE controls.
Quantized weight tensors are stored as torchao AffineQuantizedTensor instances, carrying quantized data, scale factors, zero points, and metadata. The cache interacts natively with torch.compile's optimization passes.
torchao's backend selection (tinygemm for INT4, CUTLASS for INT8, native Tensor Core for FP8/FP4) is driven by the blueprint's per-layer precision specification and the hardware capability profile.
Performance is evaluated across three axes:
Based on the literature and the QBE's design:
| Metric | Conservative | Optimistic |
|---|---|---|
| Perplexity degradation | < 0.5% | < 0.1% |
| Throughput improvement | 1.5-2x | 3-4x |
| VRAM reduction | 25-40% | 50-65% |
| Energy reduction | 25-40% per query | 40-60% |
For large-scale inference providers processing billions of queries daily:
A method for per-query adaptive quantization of a large language model during autoregressive inference, comprising query semantic analysis, per-layer blueprint generation, and blueprint-guided execution where different queries receive different precision assignments.
The method above where the blueprint jointly specifies per-layer weight precision, KV-cache precision, and activation precision.
A method for mid-generation blueprint refresh based on sustained output entropy transitions, enabling intra-generation precision adaptation without interrupting inference.
A method for joint speculative decoding and adaptive quantization with dynamic precision gap control based on rolling draft-token acceptance rate.
SLA-driven precision routing where a single model instance serves requests at different precision levels based on per-request budget parameters.
Intra-forward-pass precision cascading driven by attention entropy, adjusting subsequent layer precision within the same forward pass.
A system for hardware-native multi-format execution with automatic precision tier remapping across GPU generations.
A method for dynamic weight re-quantization with a precision cache using torchao AffineQuantizedTensor instances, resolving the storage paradox.
A method for adaptive quantization of hybrid transformer-SSM models with temporal-drift-constrained precision flooring for recurrent state layers, type-aware layer classification, and fused dequantize-state-update-requantize kernels.
Hierarchical blueprint caching with semantic fingerprinting, multi-tier storage, and gossip protocol federation.
A method for token-difficulty-aware activation precision scheduling using mixed-precision kernels.
A proxy scorer model for accelerated blueprint prediction trained on production traffic.
Energy and carbon-aware blueprinting with grid carbon intensity integration.
Quantization Blueprinting represents a shift in how LLMs are executed: not as monolithic compute blocks, but as flexible, task-aware systems. By aligning model precision with query complexity at runtime — and adapting that alignment during generation — the QBE opens a new dimension of optimization that is orthogonal to every existing quantization method.
The system is designed for production deployment: it integrates with existing frameworks (vLLM, torchao, Triton), runs on current hardware (Blackwell, Hopper, Ampere), and provides economic incentives through differentiated pricing. The multi-architecture support (transformers, MoE, hybrid SSM, vision-language) ensures applicability across the rapidly evolving model landscape.
Every existing quantization method determines precision offline or adapts only a single dimension at runtime. The QBE jointly adapts weights, activations, and KV-cache precision per layer per query — and adapts that assignment mid-generation based on output entropy.
Contact: [email protected] https://marow.ai