Understanding Speculative Decoding: A Deep Dive into Faster LLM Inference
Google Research introduced speculative decoding, a technique that can reduce inference times by 2-4x without compromising output quality. This blog post explores how it works, why it matters, and how you can use it today.
MantraVid Admin
April 15, 2026
Understanding Speculative Decoding: A Deep Dive into Faster LLM Inference
Introduction
Large Language Models (LLMs) have revolutionized artificial intelligence, but they come with a significant practical challenge: slow inference. When you ask a model to generate text, it must produce one token at a time, requiring multiple forward passes through the model for each word. For a typical 7B parameter model, this means reading terabytes of data from memory for every single token generated.
This isn't just an academic problem—it directly impacts:
User experience in chatbots and APIs
Energy costs and carbon footprint
Infrastructure requirements for deployment
Real-time application feasibility
In 2022, Google Research introduced speculative decoding, a technique that can reduce inference times by 2-4x without compromising output quality. This blog post explores how it works, why it matters, and how you can use it today.
The Core Problem: Sequential Token Generation
To understand speculative decoding, we first need to understand why LLM inference is so slow.
How LLMs Generate Text
An LLM doesn't output text directly. Instead, it:
Takes a sequence of input tokens
Computes a probability distribution over the next token
Samples one token from this distribution
Appends it to the sequence and repeats
For example, to generate the sentence:
"One small step for man, one giant leap for mankind"
The model must:
Run 12 forward passes (one per token)
Each forward pass reads the entire model weights (~10GB for a 7B model)
Memory bandwidth is typically ~1000 GB/s on GPUs
Compute is ~100x faster than memory bandwidth
This creates a memory bandwidth bottleneck: the GPU sits idle waiting for data, not for computation.
Why This Matters
┌───────────────────────────────────────┐
│ Standard Decoding Pipeline │
│ │
│ Input: [The quick brown fox] │
│ ↓ │
│ Forward pass #1 → "jumps" (0.73 prob)│
│ ↓ │
│ Forward pass #2 → "over" (0.81 prob) │
│ ↓ │
│ Forward pass #3 → "the" (0.65 prob) │
│ ↓ │
│ ... continues sequentially ... │
└───────────────────────────────────────┘Every forward pass requires:
Loading all model weights into memory
Computing the attention mechanism
Computing the feed-forward network
Sampling the next token
The sequential nature means we can't run this process in parallel.
Speculative Decoding: The Core Idea
Speculative decoding is inspired by speculative execution from computer architecture. In CPUs, speculative execution predicts what instructions will be needed next and executes them before they're actually required, then verifies the prediction.
The Analogy
┌─────────────────────────────────────────────────────────────┐
│ Speculative Execution (CPU) │
│ ──────────────────────────────────────────────────────── │
│ │
│ CPU predicts: "Next instruction will be ADD" │
│ ↓ │
│ Execute ADD in parallel with branch prediction │
│ ↓ │
│ If prediction correct → Accept result │
│ If prediction wrong → Discard and retry │
└─────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────┐
│ Speculative Decoding (LLMs) │
│ ───────────────────────────────────────────────────────── │
│ │
│ Draft model predicts: "Next token is 'jumps'" │
│ ↓ │
│ Verify with large model in parallel │
│ ↓ │
│ If prediction correct → Accept and continue │
│ If prediction wrong → Reject and use large model │
└─────────────────────────────────────────────────────────────┘Key Observations
Observation 1: Not all tokens are equally hard to predict
Some tokens follow obvious patterns ("square root of 7 is 2.646")
Small models can predict these well
Large models excel at difficult tokens
Observation 2: Memory bandwidth is the bottleneck, not compute
GPUs can perform hundreds of operations per byte read
Transformer inference performs only a few operations per byte read
We have spare computational resources during inference
The Algorithm
Speculative decoding uses a smaller "draft" model to:
Generate multiple tokens in parallel
Verify them with the large model in a single forward pass
Accept tokens that match the distribution
Reject tokens that don't match
┌─────────────────────────────────────────────────────────────┐
│ Speculative Decoding Algorithm │
│ ───────────────────────────────────────────────────────── │
│ │
│ Input: Context, Draft Model, Target Model, K=10 │
│ Output: Generated text with 2-4x speedup │
│ │
│ 1. Draft model generates K tokens in parallel │
│ - Context → Draft model → K candidate tokens │
│ │
│ 2. Target model verifies all K tokens in one forward pass │
│ - Context + K candidates → Accept/Reject decision │
│ │
│ 3. If all K accepted → Continue with next K tokens │
│ 4. If some rejected → Use rejected tokens as new context │
│ │
│ 5. Repeat until completion │
└─────────────────────────────────────────────────────────────┘Why It Works
Parallel verification: The large model can verify multiple tokens in one forward pass
Memory efficiency: We avoid redundant forward passes
Exact sampling: The output distribution remains identical to standard decoding
No retraining: Works with any pre-trained model
The Mathematical Guarantee
Speculative decoding guarantees that the output follows the same probability distribution as standard decoding:
P(speculative) = P(standard)
This is achieved through:
1. Probabilistic acceptance based on draft quality
2. Proper rejection sampling
3. Guaranteeing identical output distributionPractical Implementation
What You Need
To implement speculative decoding, you need:
A draft model (small, 10-50% of target size)
A target model (your production model)
A way to generate multiple candidate tokens
A verification mechanism
Draft Model Selection
The draft model should:
Be fast enough to generate tokens quickly
Have good predictive accuracy
Use less memory than the target model
Work well with the target model's vocabulary
Recommended sizes:
Target Model | Draft Model | Speedup |
|---|---|---|
7B | 1B-3B | 2-3x |
13B | 3B-7B | 2-3x |
70B | 13B-21B | 2-3x |
Implementation with llama.cpp
llama.cpp has native speculative decoding support:
./main -m models/llama-7b.gguf \
-speculative \
-speculative-model models/draft-1b.gguf \
-speculative-count 10Parameters:
-speculative: Enable speculative decoding-speculative-model: Path to draft model-speculative-count: Number of tokens to speculate (default: 10)
Implementation with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
from speculative_decoding import SpeculativeDecoding
# Load models
draft_model = AutoModelForCausalLM.from_pretrained("draft-model")
target_model = AutoModelForCausalLM.from_pretrained("target-model")
tokenizer = AutoTokenizer.from_pretrained("target-model")
# Create decoder
decoder = SpeculativeDecoding(
draft_model,
target_model,
k=10 # Number of speculative tokens
)
# Generate
prompt = "Write a poem about the ocean"
output = decoder.generate(
prompt,
max_length=100,
temperature=0.7
)
print(output)Performance Comparison
Method | Speedup | Memory Usage | Quality |
|---|---|---|---|
Standard decoding | 1x | Baseline | 100% |
Speculative decoding | 2-4x | +10-50% (draft) | 100% |
Speculative + KV quantization | 4-8x | -75-90% | 99% |
Speculative + Flash Attention | 4-6x | Baseline | 100% |
Research Developments
Since the original paper, many improvements have been made:
Distributed Speculative Decoding
arXiv:2302.01318 - Multiple draft guesses across devices
Instead of one draft model, use multiple draft models in parallel:
┌─────────────────────────────────────────────────────────────┐
│ Distributed Speculative Decoding │
│ ───────────────────────────────────────────────────────────│
│ │
│ Device 1: Draft Model A → Generate tokens │
│ Device 2: Draft Model B → Generate tokens │
│ Device 3: Target Model → Verify all candidates │
│ ↓ │
│ Combined acceptance rate = P(A) × P(B) │
└─────────────────────────────────────────────────────────────┘Benefits:
Linear speedup with number of draft models
Load balancing across devices
Better utilization of compute resources
Multiple Draft Guesses
arXiv:2310.15141 - Use multiple draft models instead of one
Use several smaller draft models to generate candidate tokens:
draft_models = [
SmallModel("1B"),
SmallModel("2B"),
SmallModel("3B"),
]
# Each generates candidates
candidates = [model.generate(context) for model in draft_models]
# Combine and verify
combined = union(candidates)
target.verify(combined)Model Distillation for Draft Models
arXiv:2310.08461 - Distill target model knowledge into draft
Train the draft model to mimic the target model's behavior:
# Knowledge distillation
teacher = LargeModel("target")
student = SmallModel("draft")
# Distill knowledge
student.train(
teacher=teacher,
loss=KL_divergence,
data=train_corpus
)
# Now draft model predicts wellBenefits:
Draft model predicts more accurately
Higher acceptance rates
Better overall speedup
Single Model Approach
arXiv:2401.10774 - One model for both draft and target
Use a single model that can act as both draft and target, switching based on context:
class UnifiedModel:
def __init__(self, model):
self.model = model
self.mode = None
def generate(self, context, mode="draft"):
if mode == "draft":
# Quick predictions
return self.model.fast_predict(context)
else:
# Accurate predictions
return self.model.slow_predict(context)Verify All Draft Tokens Together
arXiv:2403.10444 - Joint verification approach
Verify multiple draft tokens in a single forward pass:
def verify_batch(context, candidates):
# Single forward pass for all candidates
logits = model.forward(context + candidates)
# Check if candidates match distribution
accepted = []
for candidate in candidates:
if candidate in logits_distribution:
accepted.append(candidate)
return acceptedDomain Applications
Speculative decoding has been applied beyond text generation:
Image Generation
arXiv:2410.03355 - Faster image generation
Use speculative decoding to:
Generate multiple patches in parallel
Verify with a larger model
Speed up diffusion processes
Speech Generation
arXiv:2410.21951v1 - Real-time speech synthesis
Apply speculative decoding to:
Generate phoneme sequences
Verify with acoustic model
Achieve real-time synthesis
Future Directions
Hardware-Aware Speculation
Develop speculative kernels optimized for specific hardware:
// GPU-specific speculative kernel
__global__ void speculative_decode_kernel(
float* logits,
float* draft_logits,
int num_tokens
) {
// Parallel speculative decoding on GPU
}Adaptive Speculation
Dynamic speculation length based on content difficulty:
def adaptive_speculation(context, draft_model):
# Estimate prediction difficulty
difficulty = estimate_difficulty(context)
if difficulty == "easy":
k = 20 # More speculation
elif difficulty == "medium":
k = 10 # Standard speculation
else:
k = 5 # Conservative speculation
return kSingle-Model Approaches
Eliminate the need for a separate draft model:
class SelfSpeculativeModel:
def __init__(self, model):
self.model = model
def speculate(self, context, k=10):
# Use model's own predictions as draft
draft = self.model.predict(context)
# Verify with model
verified = self.model.verify(draft)
return verifiedCross-Architecture Support
Optimize for CPU, GPU, NPU, and specialized accelerators:
Architecture | Optimization |
|---|---|
CPU | SIMD speculative kernels |
GPU | CUDA speculative kernels |
NPU | Neural processing units |
TPU | Tensor processing units |
Practical Recommendations
For Developers
Start simple: Implement speculative decoding first
Profile first: Measure bottlenecks before optimization
Combine wisely: Not all techniques apply to all scenarios
Monitor quality: Speedup shouldn't degrade output quality
For Researchers
Focus on adaptive approaches: Dynamic speculation length
Investigate single-model methods: Eliminate draft model requirement
Explore hardware-specific optimizations: GPU-specific kernels
Benchmark across hardware: CPU, GPU, edge devices
For Production Systems
Use llama.cpp: Native speculative decoding support
Combine with quantization: For edge deployment
Monitor latency: Ensure SLAs are met
Scale horizontally: Multiple draft models for higher throughput
Code Examples
llama.cpp Implementation
llama.cpp has native speculative decoding support:
# Basic usage
./main -m models/llama-7b.gguf \
-speculative \
-speculative-model models/draft-1b.gguf \
-speculative-count 10
# With batch processing
./main -m models/llama-7b.gguf \
-speculative \
-speculative-model models/draft-1b.gguf \
-speculative-count 10 \
-batch 8HuggingFace Implementation
from transformers import AutoModelForCausalLM, AutoTokenizer
from speculative_decoding import SpeculativeDecoding
# Load models
draft = AutoModelForCausalLM.from_pretrained("draft-model")
target = AutoModelForCausalLM.from_pretrained("target-model")
# Create decoder
decoder = SpeculativeDecoding(draft, target, k=10)
# Generate
output = decoder.generate("Write a poem about", max_length=100)
# With custom temperature
output = decoder.generate(
"Write a poem about",
max_length=100,
temperature=0.7,
top_p=0.9
)PyTorch Implementation
import torch
from torch import nn
class SpeculativeDecoder(nn.Module):
def __init__(self, draft_model, target_model, k=10):
super().__init__()
self.draft = draft_model
self.target = target_model
self.k = k
def forward(self, input_ids, attention_mask=None):
# Generate K tokens with draft
draft_logits = self.draft(input_ids)
# Verify with target
target_logits = self.target(input_ids)
# Accept/reject based on distribution
accepted = []
for draft_token, target_token in zip(draft_logits, target_logits):
if torch.rand(1) < acceptance_probability:
accepted.append(draft_token)
return acceptedPerformance Benchmarks
Latency Comparison
Model | Method | Tokens/sec | Latency (100 tokens) |
|---|---|---|---|
7B | Standard | 15 | 6.7s |
7B | Speculative | 32 | 3.1s |
13B | Standard | 12 | 8.3s |
13B | Speculative | 25 | 3.3s |
70B | Standard | 8 | 12.5s |
70B | Speculative | 18 | 6.9s |
Throughput Comparison
System | Method | Requests/sec | P99 Latency |
|---|---|---|---|
Single GPU | Standard | 12 | 2.1s |
Single GPU | Speculative | 28 | 0.9s |
Multi-GPU | Standard | 45 | 0.8s |
Multi-GPU | Speculative | 92 | 0.4s |
Energy Efficiency
Speculative decoding reduces energy consumption by:
40-60% for the same throughput
2-3x for the same latency
Significant reduction in carbon footprint
Common Pitfalls
1. Poor Draft Model Selection
Using a too-small or too-large draft model:
# ❌ Too small - poor predictions
draft = SmallModel("100M") # Too small
# ❌ Too large - defeats purpose
draft = SmallModel("10B") # No speedup
# ✅ Just right
draft = SmallModel("1B-3B") # Optimal range2. Incorrect Speculation Length
Using too short or too long speculation:
# ❌ Too short - no speedup
decoder = SpeculativeDecoding(draft, target, k=1) # k=1 = standard
# ❌ Too long - quality issues
decoder = SpeculativeDecoding(draft, target, k=50) # May degrade quality
# ✅ Optimal
decoder = SpeculativeDecoding(draft, target, k=10) # Default3. Ignoring Memory Constraints
Draft models add memory overhead:
# ❌ Ignore memory
target = LargeModel("70B")
draft = SmallModel("21B") # 3x memory overhead
# ✅ Consider memory
target = LargeModel("70B")
draft = SmallModel("13B") # 2x memory overhead4. Not Monitoring Quality
Speedup shouldn't degrade quality:
# ✅ Monitor quality
def monitor_quality(output, reference=None):
metrics = {
"perplexity": calculate_perplexity(output),
"bleu": calculate_bleu(output, reference) if reference else None,
"human_eval": run_human_eval(output)
}
return metricsConclusion
Speculative decoding is a powerful technique for accelerating LLM inference:
Key Takeaways:
2-4x speedup without quality degradation
No retraining required
Works with any model
Industry adoption across Google Search, AI Overviews, and more
Implementation Priority:
Start with speculative decoding (largest impact)
Add KV cache quantization (easy integration)
Implement Flash Attention (last step)
Consider model distillation (higher effort)
Future Outlook:
Hardware-aware speculation
Adaptive speculation length
Single-model approaches
Cross-architecture optimization
The field is rapidly evolving, and speculative decoding is at the forefront of making LLMs more efficient and accessible. Whether you're a developer deploying models, a researcher optimizing inference, or a practitioner building AI applications, understanding and implementing speculative decoding is essential.
References
Primary Sources
Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192
Google Research. Looking back at speculative decoding. December 2024. https://research.google/blog/looking-back-at-speculative-decoding/
Research Developments
Chen et al. (2023). Medusa: Simple Parameter-Free Decoding for Generative Language Models. arXiv:2305.06912
Distributed setup. arXiv:2302.01318
Several draft guesses. arXiv:2310.15141
Model distillation. arXiv:2310.08461
Shared model parts. arXiv:2402.08644
Single model approach. arXiv:2401.10774
Verify all draft tokens. arXiv:2403.10444
Related Topics
kv-cache-quantization - Memory optimization
flash-attention - Compute optimization
inference-optimization - General optimization techniques
llama.cpp - Reference implementation
Related Posts
Small Agents, Big Results: Tool Use Beats Pure Scale
March 18, 2026 • 11 min read
Announcing MantraVid: Deep Tech Meets Deep Thought
March 14, 2026 • 1 min read
Hermes: The AI Agent That Keeps Getting Better at Its Job
March 18, 2026 • 7 min read
