Multi-Token Prediction and the Reversal Reasoning Circuit
Multi-Token Prediction (MTP) trains parallel output heads to predict tokens t+1:t+k simultaneously. This mechanism induces reversal reasoning in Transformers, where the model attends to goal nodes first, then traces paths backward through intermediate nodes.
MantraVid Admin•April 16, 2026
1 min