The Amortization of Thought: How Lifelong Learning Agents Learn to Stop Planning
Introduction
There is a peculiar quality to expertise. Watch a master craftsperson at work, or an athlete in flow, and you will notice a conspicuous absence of hesitation. The chess grandmaster does not calculate every branch; the surgeon does not deliberate over each incision. Skill, at its highest level, appears to operate below the threshold of conscious computation. In cognitive science, this phenomenon is often termed automaticity: the transition from controlled, effortful processing to rapid, reflexive execution.
In the context of artificial intelligence, particularly reinforcement learning (RL), this transition has remained more aspirational than formalized. Standard RL agents typically face a binary choice between slow, deliberative planning (model based methods) and fast, reactive policies (model free methods). Robotics architectures often institutionalize this divide, explicitly switching between a computationally expensive planner for novel situations and a cached policy for routine execution. A recent theoretical contribution challenges this dichotomy. In Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation, the authors present an algorithm that blurs this boundary, demonstrating how an agent can accumulate competence across a sequence of tasks until the need for deliberate planning effectively vanishes, while performance remains near optimal.
The Algorithmic Architecture of UCBlvd
The paper introduces UCB Lifelong Value Distillation (UCBlvd), operating within the framework of linear contextual Markov Decision Processes (MDPs). In this lifelong setting, the agent does not merely solve one task; it faces a streaming sequence of tasks, each represented by a linear MDP with shared dynamics but potentially distinct reward functions. The challenge is not just to minimize regret across this sequence, but to do so while limiting the computational overhead of planning.
The technical results are striking. For $K$ task episodes with horizon $H$, UCBlvd achieves a regret bound of $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$, where $d$ represents the feature dimension of the transition dynamics and $d^\prime$ the dimension of the reward function. More significantly, the algorithm requires only $\mathcal{O}(dH\log(K))$ planning calls across all $K$ episodes. This is sublinear in the number of tasks; as $K$ grows, the average computational cost per task approaches zero.
To appreciate this result, one must understand what constitutes a planning call in this framework. In theoretical RL, a planning call typically refers to solving an optimization problem to compute a value function or policy given a model or value estimate. In standard approaches, each new task might require fresh planning proportional to the horizon or state space. UCBlvd achieves its efficiency through a novel structural assumption that enables computation sharing across tasks during the exploration phase. Rather than treating each task as an isolated optimization problem, the algorithm distills value information into a shared representation that becomes increasingly complete as the agent encounters more reward functions.
The logarithmic dependence on $K$ suggests a process of crystallization. Early in the learning process, when the agent encounters novel task configurations, planning is frequent and necessary. As the agent accumulates experience across the task distribution, the shared structure of the linear MDPs becomes fully characterized. The agent transitions from a mode of active deliberation to one of automatic execution, not through architectural switching, but through the amortization of computation across the task sequence.
The Dissolving Boundary Between Planning and Execution
The implications of this logarithmic planning bound extend beyond algorithmic efficiency into the philosophy of agent design. Current robotic systems often rely on explicit hierarchies: a high level deliberative module (perhaps a motion planner or task planner) handles novel scenarios, while a low level controller executes cached policies. This architecture reflects a folk theory of intelligence that separates "thinking" from "doing."
UCBlvd suggests a different model, one where expertise emerges not from faster thinking, but from the eventual redundancy of thought. The algorithm maintains sublinear regret while effectively ceasing to plan, suggesting that the distinction between model based and model free control may be better understood as a continuum of computational investment rather than a categorical distinction. In this view, a model based agent that has sufficiently explored its environment becomes indistinguishable from a model free expert, not because it has memorized specific policies, but because it has internalized the structure of the world to the point where novel situations are immediately recognizable as variations of previously solved problems.
This resonates with findings in neuroscience regarding the shift from hippocampal (deliberative) to cortical (automatic) control in skill acquisition. The mathematical structure of UCBlvd provides a formal analogue to this biological process. The $\mathcal{O}(dH\log(K))$ bound represents a quantitative measure of how much deliberate computation is required to achieve automaticity for a given task distribution. It formalizes the intuition that expertise in a well defined domain should require negligible cognitive overhead, while remaining adaptable within that domain.
Critical Analysis and Limitations
Despite its elegance, the theoretical framework rests on assumptions that warrant careful scrutiny. The linear MDP assumption, while standard in theoretical RL, represents a significant restriction. Real world dynamics rarely conform to linear function approximation; they exhibit discontinuities, nonlinear interactions, and high dimensional state spaces that may not admit low dimensional feature representations. The dimensions $d$ and $d^\prime$, while hidden in big O notation, matter profoundly in practice. If the feature space required to represent the dynamics is large, the logarithmic savings in planning calls may be overwhelmed by the constants embedded in the regret bound.
Furthermore, the analysis assumes a specific form of task relatedness. The tasks must share transition dynamics while varying in rewards, and they must be drawn from a distribution that permits the computation sharing mechanism to function. This is not general transfer learning; it is structured lifelong learning within a well defined linear family. The algorithm would likely fail to achieve sublinear regret if presented with tasks that violated the shared dynamical structure, or if the task sequence were adversarially constructed outside the linear contextual MDP framework.
There is also a subtle distinction between planning calls as defined algorithmically and the broader notion of computational cost. The bound counts explicit optimization steps, but in a neural implementation, "execution" itself requires computation. The result does not imply that the agent becomes computationally cheaper to run in terms of FLOPs; rather, it eliminates the need for explicit search or optimization at decision time. This is analogous to the difference between retrieving a cached answer and computing it from scratch, though in this case, the "cache" is a sophisticated value function approximation that continues to improve with exposure to new tasks.
Conclusion and Open Questions
The contribution of UCBlvd lies not merely in its regret bounds, but in its conceptual reframing of expertise as the amortization of planning. It provides a formal proof that, within the linear MDP setting, an agent can accumulate experiences such that the marginal computational cost of solving new tasks vanishes logarithmically, while maintaining performance guarantees.
Several questions remain open. Can these results extend to nonlinear function approximation, where neural networks replace linear features? In such settings, the notion of computation sharing becomes more complex, as representations may need to adapt nonlinearly to new tasks. What happens when the task distribution is non stationary, when the world itself changes over time? The current analysis assumes a fixed underlying structure; lifelong learning in evolving environments presents additional challenges.
Perhaps most intriguing is the question of whether this framework can inform the design of robotic systems. If the theoretical prediction holds, we should build unified architectures where the boundary between planning and execution dissolves through experience, rather than enforcing it through explicit switches. The vision of an agent that begins by deliberating carefully over each action, and gradually transitions to fluid, automatic expertise, offers a compelling mathematical model for how artificial systems, like biological ones, might truly learn to stop thinking and start knowing.