Understanding Speculative Decoding: A Deep Dive into Faster LLM Inference

Google Research introduced speculative decoding, a technique that can reduce inference times by 2-4x without compromising output quality. This blog post explores how it works, why it matters, and how you can use it today.

M

MantraVid Admin

April 15, 2026

13 min read7 views
Share:

Understanding Speculative Decoding: A Deep Dive into Faster LLM Inference

Introduction

Large Language Models (LLMs) have revolutionized artificial intelligence, but they come with a significant practical challenge: slow inference. When you ask a model to generate text, it must produce one token at a time, requiring multiple forward passes through the model for each word. For a typical 7B parameter model, this means reading terabytes of data from memory for every single token generated.

This isn't just an academic problem—it directly impacts:

  • User experience in chatbots and APIs

  • Energy costs and carbon footprint

  • Infrastructure requirements for deployment

  • Real-time application feasibility

In 2022, Google Research introduced speculative decoding, a technique that can reduce inference times by 2-4x without compromising output quality. This blog post explores how it works, why it matters, and how you can use it today.


The Core Problem: Sequential Token Generation

To understand speculative decoding, we first need to understand why LLM inference is so slow.

How LLMs Generate Text

An LLM doesn't output text directly. Instead, it:

  1. Takes a sequence of input tokens

  2. Computes a probability distribution over the next token

  3. Samples one token from this distribution

  4. Appends it to the sequence and repeats

For example, to generate the sentence:

"One small step for man, one giant leap for mankind"

The model must:

  • Run 12 forward passes (one per token)

  • Each forward pass reads the entire model weights (~10GB for a 7B model)

  • Memory bandwidth is typically ~1000 GB/s on GPUs

  • Compute is ~100x faster than memory bandwidth

This creates a memory bandwidth bottleneck: the GPU sits idle waiting for data, not for computation.

Why This Matters

┌───────────────────────────────────────┐
│  Standard Decoding Pipeline           │
│                                       │
│  Input: [The quick brown fox]         │
│  ↓                                    │
│  Forward pass #1 → "jumps" (0.73 prob)│
│  ↓                                    │
│  Forward pass #2 → "over" (0.81 prob) │
│  ↓                                    │
│  Forward pass #3 → "the" (0.65 prob)  │
│  ↓                                    │
│  ... continues sequentially ...       │
└───────────────────────────────────────┘

Every forward pass requires:

  • Loading all model weights into memory

  • Computing the attention mechanism

  • Computing the feed-forward network

  • Sampling the next token

The sequential nature means we can't run this process in parallel.


Speculative Decoding: The Core Idea

Speculative decoding is inspired by speculative execution from computer architecture. In CPUs, speculative execution predicts what instructions will be needed next and executes them before they're actually required, then verifies the prediction.

The Analogy

┌─────────────────────────────────────────────────────────────┐
│  Speculative Execution (CPU)                                │
│  ────────────────────────────────────────────────────────   │
│                                                             │
│  CPU predicts: "Next instruction will be ADD"               │
│  ↓                                                          │
│  Execute ADD in parallel with branch prediction             │
│  ↓                                                          │
│  If prediction correct → Accept result                      │
│  If prediction wrong → Discard and retry                    │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│  Speculative Decoding (LLMs)                                │
│  ─────────────────────────────────────────────────────────  │
│                                                             │
│  Draft model predicts: "Next token is 'jumps'"              │
│  ↓                                                          │
│  Verify with large model in parallel                        │
│  ↓                                                          │
│  If prediction correct → Accept and continue                │
│  If prediction wrong → Reject and use large model           │
└─────────────────────────────────────────────────────────────┘

Key Observations

Observation 1: Not all tokens are equally hard to predict

  • Some tokens follow obvious patterns ("square root of 7 is 2.646")

  • Small models can predict these well

  • Large models excel at difficult tokens

Observation 2: Memory bandwidth is the bottleneck, not compute

  • GPUs can perform hundreds of operations per byte read

  • Transformer inference performs only a few operations per byte read

  • We have spare computational resources during inference

The Algorithm

Speculative decoding uses a smaller "draft" model to:

  1. Generate multiple tokens in parallel

  2. Verify them with the large model in a single forward pass

  3. Accept tokens that match the distribution

  4. Reject tokens that don't match

┌─────────────────────────────────────────────────────────────┐
│  Speculative Decoding Algorithm                             │
│  ─────────────────────────────────────────────────────────  │
│                                                             │
│  Input: Context, Draft Model, Target Model, K=10            │
│  Output: Generated text with 2-4x speedup                   │
│                                                             │
│  1. Draft model generates K tokens in parallel              │
│     - Context → Draft model → K candidate tokens            │
│                                                             │
│  2. Target model verifies all K tokens in one forward pass  │
│     - Context + K candidates → Accept/Reject decision       │
│                                                             │
│  3. If all K accepted → Continue with next K tokens         │
│  4. If some rejected → Use rejected tokens as new context   │
│                                                             │
│  5. Repeat until completion                                 │
└─────────────────────────────────────────────────────────────┘

Why It Works

  1. Parallel verification: The large model can verify multiple tokens in one forward pass

  2. Memory efficiency: We avoid redundant forward passes

  3. Exact sampling: The output distribution remains identical to standard decoding

  4. No retraining: Works with any pre-trained model

The Mathematical Guarantee

Speculative decoding guarantees that the output follows the same probability distribution as standard decoding:

P(speculative) = P(standard)

This is achieved through:
1. Probabilistic acceptance based on draft quality
2. Proper rejection sampling
3. Guaranteeing identical output distribution

Practical Implementation

What You Need

To implement speculative decoding, you need:

  1. A draft model (small, 10-50% of target size)

  2. A target model (your production model)

  3. A way to generate multiple candidate tokens

  4. A verification mechanism

Draft Model Selection

The draft model should:

  • Be fast enough to generate tokens quickly

  • Have good predictive accuracy

  • Use less memory than the target model

  • Work well with the target model's vocabulary

Recommended sizes:

Target Model

Draft Model

Speedup

7B

1B-3B

2-3x

13B

3B-7B

2-3x

70B

13B-21B

2-3x

Implementation with llama.cpp

llama.cpp has native speculative decoding support:

./main -m models/llama-7b.gguf \
    -speculative \
    -speculative-model models/draft-1b.gguf \
    -speculative-count 10

Parameters:

  • -speculative: Enable speculative decoding

  • -speculative-model: Path to draft model

  • -speculative-count: Number of tokens to speculate (default: 10)

Implementation with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer
from speculative_decoding import SpeculativeDecoding

# Load models
draft_model = AutoModelForCausalLM.from_pretrained("draft-model")
target_model = AutoModelForCausalLM.from_pretrained("target-model")
tokenizer = AutoTokenizer.from_pretrained("target-model")

# Create decoder
decoder = SpeculativeDecoding(
    draft_model, 
    target_model, 
    k=10  # Number of speculative tokens
)

# Generate
prompt = "Write a poem about the ocean"
output = decoder.generate(
    prompt, 
    max_length=100,
    temperature=0.7
)

print(output)

Performance Comparison

Method

Speedup

Memory Usage

Quality

Standard decoding

1x

Baseline

100%

Speculative decoding

2-4x

+10-50% (draft)

100%

Speculative + KV quantization

4-8x

-75-90%

99%

Speculative + Flash Attention

4-6x

Baseline

100%


Research Developments

Since the original paper, many improvements have been made:

Distributed Speculative Decoding

arXiv:2302.01318 - Multiple draft guesses across devices

Instead of one draft model, use multiple draft models in parallel:

┌─────────────────────────────────────────────────────────────┐
│  Distributed Speculative Decoding                           │
│  ───────────────────────────────────────────────────────────│
│                                                             │
│  Device 1: Draft Model A → Generate tokens                  │
│  Device 2: Draft Model B → Generate tokens                  │
│  Device 3: Target Model → Verify all candidates             │
│  ↓                                                          │
│  Combined acceptance rate = P(A) × P(B)                     │
└─────────────────────────────────────────────────────────────┘

Benefits:

  • Linear speedup with number of draft models

  • Load balancing across devices

  • Better utilization of compute resources

Multiple Draft Guesses

arXiv:2310.15141 - Use multiple draft models instead of one

Use several smaller draft models to generate candidate tokens:

draft_models = [
    SmallModel("1B"),
    SmallModel("2B"),
    SmallModel("3B"),
]

# Each generates candidates
candidates = [model.generate(context) for model in draft_models]

# Combine and verify
combined = union(candidates)
target.verify(combined)

Model Distillation for Draft Models

arXiv:2310.08461 - Distill target model knowledge into draft

Train the draft model to mimic the target model's behavior:

# Knowledge distillation
teacher = LargeModel("target")
student = SmallModel("draft")

# Distill knowledge
student.train(
    teacher=teacher,
    loss=KL_divergence,
    data=train_corpus
)

# Now draft model predicts well

Benefits:

  • Draft model predicts more accurately

  • Higher acceptance rates

  • Better overall speedup

Single Model Approach

arXiv:2401.10774 - One model for both draft and target

Use a single model that can act as both draft and target, switching based on context:

class UnifiedModel:
    def __init__(self, model):
        self.model = model
        self.mode = None
    
    def generate(self, context, mode="draft"):
        if mode == "draft":
            # Quick predictions
            return self.model.fast_predict(context)
        else:
            # Accurate predictions
            return self.model.slow_predict(context)

Verify All Draft Tokens Together

arXiv:2403.10444 - Joint verification approach

Verify multiple draft tokens in a single forward pass:

def verify_batch(context, candidates):
    # Single forward pass for all candidates
    logits = model.forward(context + candidates)
    
    # Check if candidates match distribution
    accepted = []
    for candidate in candidates:
        if candidate in logits_distribution:
            accepted.append(candidate)
    
    return accepted

Domain Applications

Speculative decoding has been applied beyond text generation:

Image Generation

arXiv:2410.03355 - Faster image generation

Use speculative decoding to:

  • Generate multiple patches in parallel

  • Verify with a larger model

  • Speed up diffusion processes

Speech Generation

arXiv:2410.21951v1 - Real-time speech synthesis

Apply speculative decoding to:

  • Generate phoneme sequences

  • Verify with acoustic model

  • Achieve real-time synthesis


Future Directions

Hardware-Aware Speculation

Develop speculative kernels optimized for specific hardware:

// GPU-specific speculative kernel
__global__ void speculative_decode_kernel(
    float* logits,
    float* draft_logits,
    int num_tokens
) {
    // Parallel speculative decoding on GPU
}

Adaptive Speculation

Dynamic speculation length based on content difficulty:

def adaptive_speculation(context, draft_model):
    # Estimate prediction difficulty
    difficulty = estimate_difficulty(context)
    
    if difficulty == "easy":
        k = 20  # More speculation
    elif difficulty == "medium":
        k = 10  # Standard speculation
    else:
        k = 5   # Conservative speculation
    
    return k

Single-Model Approaches

Eliminate the need for a separate draft model:

class SelfSpeculativeModel:
    def __init__(self, model):
        self.model = model
    
    def speculate(self, context, k=10):
        # Use model's own predictions as draft
        draft = self.model.predict(context)
        
        # Verify with model
        verified = self.model.verify(draft)
        
        return verified

Cross-Architecture Support

Optimize for CPU, GPU, NPU, and specialized accelerators:

Architecture

Optimization

CPU

SIMD speculative kernels

GPU

CUDA speculative kernels

NPU

Neural processing units

TPU

Tensor processing units


Practical Recommendations

For Developers

  1. Start simple: Implement speculative decoding first

  2. Profile first: Measure bottlenecks before optimization

  3. Combine wisely: Not all techniques apply to all scenarios

  4. Monitor quality: Speedup shouldn't degrade output quality

For Researchers

  1. Focus on adaptive approaches: Dynamic speculation length

  2. Investigate single-model methods: Eliminate draft model requirement

  3. Explore hardware-specific optimizations: GPU-specific kernels

  4. Benchmark across hardware: CPU, GPU, edge devices

For Production Systems

  1. Use llama.cpp: Native speculative decoding support

  2. Combine with quantization: For edge deployment

  3. Monitor latency: Ensure SLAs are met

  4. Scale horizontally: Multiple draft models for higher throughput


Code Examples

llama.cpp Implementation

llama.cpp has native speculative decoding support:

# Basic usage
./main -m models/llama-7b.gguf \
    -speculative \
    -speculative-model models/draft-1b.gguf \
    -speculative-count 10

# With batch processing
./main -m models/llama-7b.gguf \
    -speculative \
    -speculative-model models/draft-1b.gguf \
    -speculative-count 10 \
    -batch 8

HuggingFace Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer
from speculative_decoding import SpeculativeDecoding

# Load models
draft = AutoModelForCausalLM.from_pretrained("draft-model")
target = AutoModelForCausalLM.from_pretrained("target-model")

# Create decoder
decoder = SpeculativeDecoding(draft, target, k=10)

# Generate
output = decoder.generate("Write a poem about", max_length=100)

# With custom temperature
output = decoder.generate(
    "Write a poem about", 
    max_length=100,
    temperature=0.7,
    top_p=0.9
)

PyTorch Implementation

import torch
from torch import nn

class SpeculativeDecoder(nn.Module):
    def __init__(self, draft_model, target_model, k=10):
        super().__init__()
        self.draft = draft_model
        self.target = target_model
        self.k = k
    
    def forward(self, input_ids, attention_mask=None):
        # Generate K tokens with draft
        draft_logits = self.draft(input_ids)
        
        # Verify with target
        target_logits = self.target(input_ids)
        
        # Accept/reject based on distribution
        accepted = []
        for draft_token, target_token in zip(draft_logits, target_logits):
            if torch.rand(1) < acceptance_probability:
                accepted.append(draft_token)
        
        return accepted

Performance Benchmarks

Latency Comparison

Model

Method

Tokens/sec

Latency (100 tokens)

7B

Standard

15

6.7s

7B

Speculative

32

3.1s

13B

Standard

12

8.3s

13B

Speculative

25

3.3s

70B

Standard

8

12.5s

70B

Speculative

18

6.9s

Throughput Comparison

System

Method

Requests/sec

P99 Latency

Single GPU

Standard

12

2.1s

Single GPU

Speculative

28

0.9s

Multi-GPU

Standard

45

0.8s

Multi-GPU

Speculative

92

0.4s

Energy Efficiency

Speculative decoding reduces energy consumption by:

  • 40-60% for the same throughput

  • 2-3x for the same latency

  • Significant reduction in carbon footprint


Common Pitfalls

1. Poor Draft Model Selection

Using a too-small or too-large draft model:

# ❌ Too small - poor predictions
draft = SmallModel("100M")  # Too small

# ❌ Too large - defeats purpose
draft = SmallModel("10B")  # No speedup

# ✅ Just right
draft = SmallModel("1B-3B")  # Optimal range

2. Incorrect Speculation Length

Using too short or too long speculation:

# ❌ Too short - no speedup
decoder = SpeculativeDecoding(draft, target, k=1)  # k=1 = standard

# ❌ Too long - quality issues
decoder = SpeculativeDecoding(draft, target, k=50)  # May degrade quality

# ✅ Optimal
decoder = SpeculativeDecoding(draft, target, k=10)  # Default

3. Ignoring Memory Constraints

Draft models add memory overhead:

# ❌ Ignore memory
target = LargeModel("70B")
draft = SmallModel("21B")  # 3x memory overhead

# ✅ Consider memory
target = LargeModel("70B")
draft = SmallModel("13B")  # 2x memory overhead

4. Not Monitoring Quality

Speedup shouldn't degrade quality:

# ✅ Monitor quality
def monitor_quality(output, reference=None):
    metrics = {
        "perplexity": calculate_perplexity(output),
        "bleu": calculate_bleu(output, reference) if reference else None,
        "human_eval": run_human_eval(output)
    }
    return metrics

Conclusion

Speculative decoding is a powerful technique for accelerating LLM inference:

Key Takeaways:

  1. 2-4x speedup without quality degradation

  2. No retraining required

  3. Works with any model

  4. Industry adoption across Google Search, AI Overviews, and more

Implementation Priority:

  1. Start with speculative decoding (largest impact)

  2. Add KV cache quantization (easy integration)

  3. Implement Flash Attention (last step)

  4. Consider model distillation (higher effort)

Future Outlook:

  • Hardware-aware speculation

  • Adaptive speculation length

  • Single-model approaches

  • Cross-architecture optimization

The field is rapidly evolving, and speculative decoding is at the forefront of making LLMs more efficient and accessible. Whether you're a developer deploying models, a researcher optimizing inference, or a practitioner building AI applications, understanding and implementing speculative decoding is essential.


References

Primary Sources

  1. Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192

  2. Google Research. Looking back at speculative decoding. December 2024. https://research.google/blog/looking-back-at-speculative-decoding/

Research Developments

  1. Chen et al. (2023). Medusa: Simple Parameter-Free Decoding for Generative Language Models. arXiv:2305.06912

  2. Distributed setup. arXiv:2302.01318

  3. Several draft guesses. arXiv:2310.15141

  4. Model distillation. arXiv:2310.08461

  5. Shared model parts. arXiv:2402.08644

  6. Single model approach. arXiv:2401.10774

  7. Verify all draft tokens. arXiv:2403.10444

  • kv-cache-quantization - Memory optimization

  • flash-attention - Compute optimization

  • inference-optimization - General optimization techniques

  • llama.cpp - Reference implementation

Related Posts