Understanding Speculative Decoding: A Deep Dive into Faster LLM Inference

Introduction

Large Language Models (LLMs) have revolutionized artificial intelligence, but they come with a significant practical challenge: slow inference. When you ask a model to generate text, it must produce one token at a time, requiring multiple forward passes through the model for each word. For a typical 7B parameter model, this means reading terabytes of data from memory for every single token generated.

This isn't just an academic problem—it directly impacts:

User experience in chatbots and APIs
Energy costs and carbon footprint
Infrastructure requirements for deployment
Real-time application feasibility

In 2022, Google Research introduced speculative decoding, a technique that can reduce inference times by 2-4x without compromising output quality. This blog post explores how it works, why it matters, and how you can use it today.

The Core Problem: Sequential Token Generation

To understand speculative decoding, we first need to understand why LLM inference is so slow.

How LLMs Generate Text

An LLM doesn't output text directly. Instead, it:

Takes a sequence of input tokens
Computes a probability distribution over the next token
Samples one token from this distribution
Appends it to the sequence and repeats

For example, to generate the sentence:

"One small step for man, one giant leap for mankind"

The model must:

Run 12 forward passes (one per token)
Each forward pass reads the entire model weights (~10GB for a 7B model)
Memory bandwidth is typically ~1000 GB/s on GPUs
Compute is ~100x faster than memory bandwidth

This creates a memory bandwidth bottleneck: the GPU sits idle waiting for data, not for computation.

Why This Matters

┌───────────────────────────────────────┐
│  Standard Decoding Pipeline           │
│                                       │
│  Input: [The quick brown fox]         │
│  ↓                                    │
│  Forward pass #1 → "jumps" (0.73 prob)│
│  ↓                                    │
│  Forward pass #2 → "over" (0.81 prob) │
│  ↓                                    │
│  Forward pass #3 → "the" (0.65 prob)  │
│  ↓                                    │
│  ... continues sequentially ...       │
└───────────────────────────────────────┘

Every forward pass requires:

Loading all model weights into memory
Computing the attention mechanism
Computing the feed-forward network
Sampling the next token

The sequential nature means we can't run this process in parallel.

Speculative Decoding: The Core Idea

Speculative decoding is inspired by speculative execution from computer architecture. In CPUs, speculative execution predicts what instructions will be needed next and executes them before they're actually required, then verifies the prediction.

The Analogy

┌─────────────────────────────────────────────────────────────┐
│  Speculative Execution (CPU)                                │
│  ────────────────────────────────────────────────────────   │
│                                                             │
│  CPU predicts: "Next instruction will be ADD"               │
│  ↓                                                          │
│  Execute ADD in parallel with branch prediction             │
│  ↓                                                          │
│  If prediction correct → Accept result                      │
│  If prediction wrong → Discard and retry                    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  Speculative Decoding (LLMs)                                │
│  ─────────────────────────────────────────────────────────  │
│                                                             │
│  Draft model predicts: "Next token is 'jumps'"              │
│  ↓                                                          │
│  Verify with large model in parallel                        │
│  ↓                                                          │
│  If prediction correct → Accept and continue                │
│  If prediction wrong → Reject and use large model           │
└─────────────────────────────────────────────────────────────┘

Key Observations

Observation 1: Not all tokens are equally hard to predict

Some tokens follow obvious patterns ("square root of 7 is 2.646")
Small models can predict these well
Large models excel at difficult tokens

Observation 2: Memory bandwidth is the bottleneck, not compute

GPUs can perform hundreds of operations per byte read
Transformer inference performs only a few operations per byte read
We have spare computational resources during inference

The Algorithm

Speculative decoding uses a smaller "draft" model to:

Generate multiple tokens in parallel
Verify them with the large model in a single forward pass
Accept tokens that match the distribution
Reject tokens that don't match

┌─────────────────────────────────────────────────────────────┐
│  Speculative Decoding Algorithm                             │
│  ─────────────────────────────────────────────────────────  │
│                                                             │
│  Input: Context, Draft Model, Target Model, K=10            │
│  Output: Generated text with 2-4x speedup                   │
│                                                             │
│  1. Draft model generates K tokens in parallel              │
│     - Context → Draft model → K candidate tokens            │
│                                                             │
│  2. Target model verifies all K tokens in one forward pass  │
│     - Context + K candidates → Accept/Reject decision       │
│                                                             │
│  3. If all K accepted → Continue with next K tokens         │
│  4. If some rejected → Use rejected tokens as new context   │
│                                                             │
│  5. Repeat until completion                                 │
└─────────────────────────────────────────────────────────────┘

Why It Works

Parallel verification: The large model can verify multiple tokens in one forward pass
Memory efficiency: We avoid redundant forward passes
Exact sampling: The output distribution remains identical to standard decoding
No retraining: Works with any pre-trained model

The Mathematical Guarantee

Speculative decoding guarantees that the output follows the same probability distribution as standard decoding:

P(speculative) = P(standard)

This is achieved through:
1. Probabilistic acceptance based on draft quality
2. Proper rejection sampling
3. Guaranteeing identical output distribution

Practical Implementation

What You Need

To implement speculative decoding, you need:

A draft model (small, 10-50% of target size)
A target model (your production model)
A way to generate multiple candidate tokens
A verification mechanism

Draft Model Selection

The draft model should:

Be fast enough to generate tokens quickly
Have good predictive accuracy
Use less memory than the target model
Work well with the target model's vocabulary

Recommended sizes:

Target Model	Draft Model	Speedup
7B	1B-3B	2-3x
13B	3B-7B	2-3x
70B	13B-21B	2-3x

Implementation with llama.cpp

llama.cpp has native speculative decoding support:

./main -m models/llama-7b.gguf \
    -speculative \
    -speculative-model models/draft-1b.gguf \
    -speculative-count 10

Parameters:

-speculative: Enable speculative decoding
-speculative-model: Path to draft model
-speculative-count: Number of tokens to speculate (default: 10)

Implementation with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer
from speculative_decoding import SpeculativeDecoding

# Load models
draft_model = AutoModelForCausalLM.from_pretrained("draft-model")
target_model = AutoModelForCausalLM.from_pretrained("target-model")
tokenizer = AutoTokenizer.from_pretrained("target-model")

# Create decoder
decoder = SpeculativeDecoding(
    draft_model, 
    target_model, 
    k=10  # Number of speculative tokens
)

# Generate
prompt = "Write a poem about the ocean"
output = decoder.generate(
    prompt, 
    max_length=100,
    temperature=0.7
)

print(output)

Performance Comparison

Method	Speedup	Memory Usage	Quality
Standard decoding	1x	Baseline	100%
Speculative decoding	2-4x	+10-50% (draft)	100%
Speculative + KV quantization	4-8x	-75-90%	99%
Speculative + Flash Attention	4-6x	Baseline	100%

Research Developments

Since the original paper, many improvements have been made:

Distributed Speculative Decoding

arXiv:2302.01318 - Multiple draft guesses across devices

Instead of one draft model, use multiple draft models in parallel:

┌─────────────────────────────────────────────────────────────┐
│  Distributed Speculative Decoding                           │
│  ───────────────────────────────────────────────────────────│
│                                                             │
│  Device 1: Draft Model A → Generate tokens                  │
│  Device 2: Draft Model B → Generate tokens                  │
│  Device 3: Target Model → Verify all candidates             │
│  ↓                                                          │
│  Combined acceptance rate = P(A) × P(B)                     │
└─────────────────────────────────────────────────────────────┘

Benefits:

Linear speedup with number of draft models
Load balancing across devices
Better utilization of compute resources

Multiple Draft Guesses

arXiv:2310.15141 - Use multiple draft models instead of one

Use several smaller draft models to generate candidate tokens:

draft_models = [
    SmallModel("1B"),
    SmallModel("2B"),
    SmallModel("3B"),
]

# Each generates candidates
candidates = [model.generate(context) for model in draft_models]

# Combine and verify
combined = union(candidates)
target.verify(combined)

Model Distillation for Draft Models

arXiv:2310.08461 - Distill target model knowledge into draft

Train the draft model to mimic the target model's behavior:

# Knowledge distillation
teacher = LargeModel("target")
student = SmallModel("draft")

# Distill knowledge
student.train(
    teacher=teacher,
    loss=KL_divergence,
    data=train_corpus
)

# Now draft model predicts well

Benefits:

Draft model predicts more accurately
Higher acceptance rates
Better overall speedup

Single Model Approach

arXiv:2401.10774 - One model for both draft and target

Use a single model that can act as both draft and target, switching based on context:

class UnifiedModel:
    def __init__(self, model):
        self.model = model
        self.mode = None
    
    def generate(self, context, mode="draft"):
        if mode == "draft":
            # Quick predictions
            return self.model.fast_predict(context)
        else:
            # Accurate predictions
            return self.model.slow_predict(context)

Verify All Draft Tokens Together

arXiv:2403.10444 - Joint verification approach

Verify multiple draft tokens in a single forward pass:

def verify_batch(context, candidates):
    # Single forward pass for all candidates
    logits = model.forward(context + candidates)
    
    # Check if candidates match distribution
    accepted = []
    for candidate in candidates:
        if candidate in logits_distribution:
            accepted.append(candidate)
    
    return accepted

Domain Applications

Speculative decoding has been applied beyond text generation:

Image Generation

arXiv:2410.03355 - Faster image generation

Use speculative decoding to:

Generate multiple patches in parallel
Verify with a larger model
Speed up diffusion processes

Speech Generation

arXiv:2410.21951v1 - Real-time speech synthesis

Apply speculative decoding to:

Generate phoneme sequences
Verify with acoustic model
Achieve real-time synthesis

Future Directions

Hardware-Aware Speculation

Develop speculative kernels optimized for specific hardware:

// GPU-specific speculative kernel
__global__ void speculative_decode_kernel(
    float* logits,
    float* draft_logits,
    int num_tokens
) {
    // Parallel speculative decoding on GPU
}

Adaptive Speculation

Dynamic speculation length based on content difficulty:

def adaptive_speculation(context, draft_model):
    # Estimate prediction difficulty
    difficulty = estimate_difficulty(context)
    
    if difficulty == "easy":
        k = 20  # More speculation
    elif difficulty == "medium":
        k = 10  # Standard speculation
    else:
        k = 5   # Conservative speculation
    
    return k

Single-Model Approaches

Eliminate the need for a separate draft model:

class SelfSpeculativeModel:
    def __init__(self, model):
        self.model = model
    
    def speculate(self, context, k=10):
        # Use model's own predictions as draft
        draft = self.model.predict(context)
        
        # Verify with model
        verified = self.model.verify(draft)
        
        return verified

Cross-Architecture Support

Optimize for CPU, GPU, NPU, and specialized accelerators:

Architecture	Optimization
CPU	SIMD speculative kernels
GPU	CUDA speculative kernels
NPU	Neural processing units
TPU	Tensor processing units

Practical Recommendations

For Developers

Start simple: Implement speculative decoding first
Profile first: Measure bottlenecks before optimization
Combine wisely: Not all techniques apply to all scenarios
Monitor quality: Speedup shouldn't degrade output quality

For Researchers

Focus on adaptive approaches: Dynamic speculation length
Investigate single-model methods: Eliminate draft model requirement
Explore hardware-specific optimizations: GPU-specific kernels
Benchmark across hardware: CPU, GPU, edge devices

For Production Systems

Use llama.cpp: Native speculative decoding support
Combine with quantization: For edge deployment
Monitor latency: Ensure SLAs are met
Scale horizontally: Multiple draft models for higher throughput

Code Examples

llama.cpp Implementation

llama.cpp has native speculative decoding support:

# Basic usage
./main -m models/llama-7b.gguf \
    -speculative \
    -speculative-model models/draft-1b.gguf \
    -speculative-count 10

# With batch processing
./main -m models/llama-7b.gguf \
    -speculative \
    -speculative-model models/draft-1b.gguf \
    -speculative-count 10 \
    -batch 8

HuggingFace Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer
from speculative_decoding import SpeculativeDecoding

# Load models
draft = AutoModelForCausalLM.from_pretrained("draft-model")
target = AutoModelForCausalLM.from_pretrained("target-model")

# Create decoder
decoder = SpeculativeDecoding(draft, target, k=10)

# Generate
output = decoder.generate("Write a poem about", max_length=100)

# With custom temperature
output = decoder.generate(
    "Write a poem about", 
    max_length=100,
    temperature=0.7,
    top_p=0.9
)

PyTorch Implementation

import torch
from torch import nn

class SpeculativeDecoder(nn.Module):
    def __init__(self, draft_model, target_model, k=10):
        super().__init__()
        self.draft = draft_model
        self.target = target_model
        self.k = k
    
    def forward(self, input_ids, attention_mask=None):
        # Generate K tokens with draft
        draft_logits = self.draft(input_ids)
        
        # Verify with target
        target_logits = self.target(input_ids)
        
        # Accept/reject based on distribution
        accepted = []
        for draft_token, target_token in zip(draft_logits, target_logits):
            if torch.rand(1) < acceptance_probability:
                accepted.append(draft_token)
        
        return accepted

Performance Benchmarks

Latency Comparison

Model	Method	Tokens/sec	Latency (100 tokens)
7B	Standard	15	6.7s
7B	Speculative	32	3.1s
13B	Standard	12	8.3s
13B	Speculative	25	3.3s
70B	Standard	8	12.5s
70B	Speculative	18	6.9s

Throughput Comparison

System	Method	Requests/sec	P99 Latency
Single GPU	Standard	12	2.1s
Single GPU	Speculative	28	0.9s
Multi-GPU	Standard	45	0.8s
Multi-GPU	Speculative	92	0.4s

Energy Efficiency

Speculative decoding reduces energy consumption by:

40-60% for the same throughput
2-3x for the same latency
Significant reduction in carbon footprint

Common Pitfalls

1. Poor Draft Model Selection

Using a too-small or too-large draft model:

# ❌ Too small - poor predictions
draft = SmallModel("100M")  # Too small

# ❌ Too large - defeats purpose
draft = SmallModel("10B")  # No speedup

# ✅ Just right
draft = SmallModel("1B-3B")  # Optimal range

2. Incorrect Speculation Length

Using too short or too long speculation:

# ❌ Too short - no speedup
decoder = SpeculativeDecoding(draft, target, k=1)  # k=1 = standard

# ❌ Too long - quality issues
decoder = SpeculativeDecoding(draft, target, k=50)  # May degrade quality

# ✅ Optimal
decoder = SpeculativeDecoding(draft, target, k=10)  # Default

3. Ignoring Memory Constraints

Draft models add memory overhead:

# ❌ Ignore memory
target = LargeModel("70B")
draft = SmallModel("21B")  # 3x memory overhead

# ✅ Consider memory
target = LargeModel("70B")
draft = SmallModel("13B")  # 2x memory overhead

4. Not Monitoring Quality

Speedup shouldn't degrade quality:

# ✅ Monitor quality
def monitor_quality(output, reference=None):
    metrics = {
        "perplexity": calculate_perplexity(output),
        "bleu": calculate_bleu(output, reference) if reference else None,
        "human_eval": run_human_eval(output)
    }
    return metrics

Conclusion

Speculative decoding is a powerful technique for accelerating LLM inference:

Key Takeaways:

2-4x speedup without quality degradation
No retraining required
Works with any model
Industry adoption across Google Search, AI Overviews, and more

Implementation Priority:

Start with speculative decoding (largest impact)
Add KV cache quantization (easy integration)
Implement Flash Attention (last step)
Consider model distillation (higher effort)

Future Outlook:

Hardware-aware speculation
Adaptive speculation length
Single-model approaches
Cross-architecture optimization

The field is rapidly evolving, and speculative decoding is at the forefront of making LLMs more efficient and accessible. Whether you're a developer deploying models, a researcher optimizing inference, or a practitioner building AI applications, understanding and implementing speculative decoding is essential.

References

Primary Sources

Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192
Google Research. Looking back at speculative decoding. December 2024. https://research.google/blog/looking-back-at-speculative-decoding/

Research Developments

Chen et al. (2023). Medusa: Simple Parameter-Free Decoding for Generative Language Models. arXiv:2305.06912
Distributed setup. arXiv:2302.01318
Several draft guesses. arXiv:2310.15141
Model distillation. arXiv:2310.08461
Shared model parts. arXiv:2402.08644
Single model approach. arXiv:2401.10774
Verify all draft tokens. arXiv:2403.10444

kv-cache-quantization - Memory optimization
flash-attention - Compute optimization
inference-optimization - General optimization techniques
llama.cpp - Reference implementation

Understanding Speculative Decoding: A Deep Dive into Faster LLM Inference

Introduction

The Core Problem: Sequential Token Generation

How LLMs Generate Text

Why This Matters

Speculative Decoding: The Core Idea

The Analogy

Key Observations

The Algorithm

Why It Works

The Mathematical Guarantee

Practical Implementation

What You Need

Draft Model Selection

Implementation with llama.cpp

Implementation with HuggingFace

Performance Comparison

Research Developments

Distributed Speculative Decoding

Multiple Draft Guesses

Model Distillation for Draft Models

Single Model Approach

Verify All Draft Tokens Together

Domain Applications

Image Generation

Speech Generation

Future Directions

Hardware-Aware Speculation

Adaptive Speculation

Single-Model Approaches

Cross-Architecture Support

Practical Recommendations

For Developers

For Researchers

For Production Systems

Code Examples

llama.cpp Implementation

HuggingFace Implementation

PyTorch Implementation

Performance Benchmarks

Latency Comparison

Throughput Comparison

Energy Efficiency

Common Pitfalls

1. Poor Draft Model Selection

2. Incorrect Speculation Length

3. Ignoring Memory Constraints

4. Not Monitoring Quality

Conclusion

References

Primary Sources

Research Developments

Related Topics

Related Posts

Multi-Token Prediction and the Reversal Reasoning Circuit

Hermes: The AI Agent That Keeps Getting Better at Its Job

NVLink vs PCIe Parallelism on Blackwell RTX Pro GPUs: A Comprehensive Analysis