From Snapshots to Histories: Teaching Coding Agents the Archaeology of Codebases

The Alien Code Problem

The software engineering research community has reached a peculiar inflection point. Large language model based coding agents now solve a substantial fraction of problems on benchmarks like SWE-bench, yet professional maintainers routinely reject their pull requests. This disconnect, analyzed in depth in the paper "Learning to Commit: Generating Organic Pull Requests via Online Repository Memory" by Mo Li et al., reveals a fundamental blind spot in how we evaluate and train autonomous coding systems.

The issue is not functional correctness. The generated code typically compiles, passes unit tests, and satisfies the literal requirements of the issue description. Rather, maintainers reject these contributions because the code feels alien. It reimplements helper functions that already exist in internal utility modules. It ignores naming conventions established over years of development. It violates architectural constraints that are never written down in documentation but are manifest in the repository's evolutionary history. In short, the agents produce code that is syntactically valid but organically dissonant.

Current evaluation paradigms exacerbate this problem. Benchmarks like SWE-bench treat software engineering as a sequence of isolated, atemporal tasks. An agent encounters an issue, edits the codebase, and succeeds if tests pass, with no expectation of accumulated knowledge or stylistic adaptation. The repository is treated as a static corpus of text rather than a living record of decisions, compromises, and conventions accumulated over time.

Online Repository Memory and Supervised Contrastive Reflection

The core innovation of the Learning to Commit framework is the recognition that organicity is learnable from commit history. Rather than exposing an agent only to the latest repository snapshot, which reveals the final state but obscures the rationale behind structural choices, the authors propose training agents on the process by which the codebase reached its current form.

The methodology operates through a strict chronological split. The training phase considers only commits prior to a cutoff date. For each historical commit, the agent performs what the authors term supervised contrastive reflection. It blindly attempts to generate the diff described by the commit message and issue context, then compares its prediction against the actual oracle diff. The gap between the agent's generic output and the maintainer's actual changes is distilled into a continuously growing set of skills. These skills are not merely code snippets but abstract, reusable patterns capturing coding style, preferred internal APIs, and architectural invariants.

When a genuinely future pull request arrives (drawn from commits after the cutoff), the agent conditions its generation on these accumulated skills. The result is code that respects the project's implicit conventions, reuses internal abstractions rather than reimplementing them, and generally appears as if it emerged organically from the repository's own evolution rather than from generic pretraining priors.

This approach differs significantly from standard retrieval augmented generation systems that treat the codebase as a static vector database. Repository Memory, as formulated here, is dynamic and temporal. It understands not just that a utility function exists, but that the project prefers certain patterns of error handling over others, or that module boundaries have shifted in specific ways over time. The evaluation metrics reflect this focus on organicity: beyond functional correctness, the authors measure code style consistency, internal API reuse rates, and modified region plausibility.

Institutional Knowledge and the Limits of Static Context

What makes this work particularly significant is its alignment with how human developers actually join complex projects. A senior engineer onboarding to a new codebase does not merely read the current state of the files. She performs code archaeology, examining historical commits to understand why certain abstractions were introduced, which patterns are considered idiomatic, and how the maintainers prefer to structure control flow. The Learning to Commit framework operationalizes this archaeological process, treating the commit graph as a curriculum for organizational learning.

However, this methodology carries inherent constraints that warrant honest discussion. The approach assumes a rich, high quality commit history with meaningful commit messages and atomic changes. Poorly maintained repositories with monolithic commits or sparse documentation may not provide sufficient signal for effective skill distillation. Additionally, the computational cost of performing contrastive reflection across thousands of historical commits, while training free, may be prohibitive for some deployment scenarios.

There is also a deeper epistemological question regarding the generalization of these skills. The paper validates on an expert maintained internal repository, but the extent to which skills extracted from one module generalize to distant components of the same codebase, or how they handle architectural pivots and breaking changes, remains an open empirical question. The supervised nature of the reflection (comparing against oracle diffs) prevents the accumulation of self reinforcing errors that plague unsupervised continual learning approaches, but it also limits the method to historical data where ground truth is available.

The Temporal Turn in Agent Design

Beyond software engineering, this work signals a broader shift in how we conceptualize intelligent agents. The field is moving from systems that manage context within a single session toward systems that maintain and update memory across sessions and time. The recognition that why a function changed in March 2023 often matters more than its current implementation represents a fundamental reorientation from state based to process based understanding.

In this view, a repository is not a text corpus but an evolutionary record. The same temporal logic applies to legal document analysis, scientific literature review, and medical patient history management. Any domain characterized by sequential artifacts with accumulated institutional knowledge could benefit from similar approaches to supervised contrastive reflection.

The Learning to Commit framework establishes repository personalized adaptation as a first class evaluation objective. By measuring organicity alongside correctness, and by enforcing chronological evaluation splits that guarantee zero data leakage, the authors provide a rigorous paradigm for assessing whether agents truly understand a codebase or merely approximate its surface features. As coding agents move from benchmark environments to industrial deployment, this distinction between alien code and organic contributions will likely determine their practical utility far more than pass rates on standardized tests.