Research Proposal
Committee annotation
Use the Hypothesis sidebar to highlight text and leave anchored comments. If the sidebar is collapsed, open the small tab on the right edge of the page. Cloudflare Access controls portal access; Hypothesis comments require Hypothesis sign-in and should be made in the private committee group.
Background information
Literate programming remains a relevant topic to persistent problems in software engineering to this day. Since it’s inception, source code continues to explain what the system does, but they why behind the different ways to solve a particular problem are often lost or never even documented. As software projects evolve and new features are added, accompanying documentation easily diverges from it’s soul mate, the source code. Additionally, as software teams evolve, the problem compounds and unwritten knowledge is lost. The original idea by Knuth [3] to make software development human centric instead of computer centric is great in concept but has eluded wide scale adoption due to several foundational issues that have been waiting to be solved. Just as the neural network from 1943 [33] took 44 years [34] to go from idea to the application of handwritten digits, the age of literate programming is approaching from concept to application.
Despite it foundation strengths in altering the process for the human and not the machine, traditional literate programming was held back from several key deficiencies. Software development teams work fast and detailed design documentation is typically not critical path to releasing new products or features. Altering workflow to integrate the original tools, such as tangle and weave, complicate and are typically associated with adding more time to release features or ship products. Early literate programming tooling supported a minimal set of programming languages, and arguably, were only well suited for smaller projects. Finally, the lack of integration into mainstream integrated development environments narrows population of groups willing to complicate the CI pipeline. At it’s core, the structure of literate programming separate the code into files that tell stories about the code, which all but obfuscates the code from a raw file format. This makes it difficult or impossible for a person to edit code in it’s native file structure. Despite the good intent behind literate program, the proper implementation and tooling has eluded wide spread application and success.
A notable branch called semi-literate programming has seen wide-spread success in the data science and computation notebook space. This format features mostly linear execution of single page workbooks that mix prose and code to convey intent behind the implementation. However, this branch is mainly ideal for data exploration but is not sufficient for traditional software projects.
Large language models and agentic coding tools provide the opportunity to revisit the goals of Knuth’s literate programming and their feasibility in this new age of software development. Now it’s possible to setup instructions for agents to develop software and keep documentation up to date. Additionally, harnesses can be created around teams of agents to create highly coordinated workflows with role specialization. This makes it possible to image a development process where literate programming elements are maintained by agentic harnesses as part of a new software development workflow without any extra effort from the developers. In their current state, there are 5 distinct pillars of which shape the effectiveness of agentic workflows: the LLM model, Prompt Engineering, Harness Engineering, Context Engineering, and Memory Framework. Literate Programming is best positioned to contribute to the context and memory pillars rather than center on a particular mode, prompt style, or harness design.
Despite the excitement around agentic programming and the reshaping of software development into a team of AI agents [1,2], there are still long running issues to be solved. First, individual agents are limited by their context window and can stray from the original instructions across long running goals. Second, agents exhibit an anxiety to report tasks complete before the full instructions set has been fully implemented. Third, documentation and traceability is not an inherent feature the agents implement. These weaknesses make the human a bottleneck in what can be a fast paced development pipeline as the code is labor intensive to review. The software development community is hard at work on solving several of these issues, including the major companies Anthropic, OpenAI, and more recently OpenClaw and it’s derivatives.
This research investigates Literate-Agentic Programming (LAP) as a lightweight source-code and documentation strategy for making software easier for AI agents to understand, modify, and remember across bounded context windows. LAP does not require the traditional tangle and weave model due to the inherent capabilities of harnesses. Instead, LAP uses traceable links among requirements, design decisions, source code, tests, reports, and documentation, but it is treated primarily as a codebase-resident context and memory layer. Additionally, the tangle and weave requirement that hindered initial adoption is enabled in this research through the addition of decorators in the source code documentation, similar to the idea of macros but in the opposite sense. The broader concept of Harnessed-Agentic Software Programming (HASP) describes a structured approach in which agent roles, communication paths, tool permissions, workflow stages, and verification procedures are formally defined by the harness. It provides the experimental infrastructure for generating controlled codebases.
The research is inspired by rigid software-development practices used in safety-critical domains. In those environments, requirements are expected to trace to design decisions, design decisions are expected to trace to implementation, and implementation is expected to trace to verification and validation activities. Documentation is also required to remain synchronized as the system changes. This proposal does not attempt to reproduce a full safety-critical development process, but it borrows the principle that important software knowledge should remain traceable and auditable. LAP applies that principle to agentic software development by testing whether lightweight literate-programming enforcement can improve accuracy, maintenance efficiency, documentation synchronization, and improve human cognition and aid in review of the code.
The experimental focus separates initial codebase construction from later agent maintenance. Source projects will be prepared from the same concept and requirements documents using different LAP documentation and decorator strategies. The main experiment will ask single agents to understand and modify paired versions of those codebases: one version retaining the LAP elements and one version where the LAP-generated decorators and documentation have been stripped while preserving the executable source code.
Literature review
This research proposal is framed within three bodies of work: literate programming, agentic software engineering, context engineering and agent memory, and structured software development. Structured software development methods reinforce a specification, testing, and traceability framework on top of loosely defined software practices. Literature and recent commercial products highlight the increasing capability of agent system to build and ship full software products with decreasing human development. However, these agentic systems are suffer with persistence, human-inspectable design knowledge, links from requirements through code to testing, preserving design decisions, efficient source code, effective tests, and documentation drift over time. The central question is whether literate elements embedded in the codebase can supply the kind of durable context and layered memory that agents otherwise reconstruct expensively from source-code inspection.
The principles of literate programming provide the historical foundation for writing software first for human understanding rather than an execution friendly structure. Knuth’s literate programming methodology framed software development as an explanatory narrative where code was not organized in files based on code context, but in cohesive documents that explained the design and flow of the code. This resulted in code chunks being decorated with identifiers that allowed it to be re-assembled in the machine ready format [3]. This is an attractive idea but it was plagued with several drawbacks that prevented widespread acceptance of the methodology even to this day [27]. Mainly, the literate programming workflow was a disruption of traditional software development practices and lacked interface development environment plugins. Additionally, it required extra tooling for converting formats, and it was originally only compatible with a subset of languages. Later works introduced semi-literate systems consisting of computational notebooks and multi-language research tools that gained popularity within data science and related fields [28]. Despite the computation notebook advances that combined prose with code, lingering problems of traditional software development such as documentation drift, loss of design ontology, and traceability from requirements through testing still plagued the practice. The relevant unsolved problem for this research proposal is not whether prose and code can be combined, but whether structured literate decorators and documentation improve an AI agent’s ability to form the right context, preserve design rationale, and modify software without excessive rediscovery. Secondly, the documentation and decorators promote human understanding and decreasing cognitive load through enabling real-time browser integrated lensing, presenting code as a story instead of in a file structure.
Recent advances in LLMs and agentic programming capabilities has provided a novel solution to the original adoptions barriers, allowing developers to reap the benefits of Knuth’s original literate programming method. Agent use in autonomous or semi-autonomous have shown success in uses of planning, code generation, testing, debugging, tool invocation, and context management [12,18]. A critical distinction in the literature separates “vibe coding”—informal, exploratory interaction with LLMs—from “agentic coding,” which emphasizes structured workflows, explicit goals, and systematic validation [5]. This proposal narrows that distinction further by separating the agent model, prompting, harness, context, and memory variables so that the experimental treatment can focus on codebase-resident context rather than on the full agent workflow.
Recent research emphasizes the benefits of coordinated agentic teams. HyperAgent evaluates a non-traditional team structure using a Planner, Navigator, Editor, and Executor [6]. AgileCoder applies agile roles to the team structure creating agents with specialized tasks as Product Manager, Scrum Master, Developer, Senior Developer, and Tester. A step further, AutoDev leverages autonomous file editing, build execution, testing, and repository operations [9]. These recent, yet outdated, approaches proved LLMs became more useful when coupled with filesystems, shells, tools, and structured feedback loops within a team structure. ProjectGen improved upon the team structure by focusing on the planning and function allocation by breaking down large tasks into context sized chunks, building a skeleton of the code, then generating source code coupled with iterative refinement to meet the project objectives. A Semantic Software Architecture Tree is used to maintain relationships among requirements, architecture and code [10]. Cross-team orchestration is another layer that explores how the diversity of agent teams may improve solution quality when confronted with ambiguous or complex tasks [29]. Formal coordination substrates between agents are also emerging. This creates separate semantic processing from coordination logic using a topic-based Petri-net architecture, showing that some harness responsibilities can be made more structured and less dependent on repeated LLM inference [30].
Both Anthropic and OpenAI reinforce the guidance of utilizing agents where they excel, and deterministic code where tasks should be repeatable. Anthropic’s “Building Effective Agents” [21] recommends definable workflows before layering LLM autonomy. Additional Claude Code guidance emphasizes context management, subagents for parallel tasks and investigation, verification before marking tasks complete, and reusable skills from deterministic code [23]. Breadth-first tasks in teams when conducting large research tasks improves performance but at the cost of higher token usage [24]. OpenAI research aligns closely with Anthropic’s guidance around orchestration of agents, tools, and specific agent instructions [25]. These sources all support the concept of engineering agent teams instead of letting them form and communicate organically. Determinism in skills, instructions, communications paths, and workflows all align with best practices in research and from commercial product guidance.
Beyond constraining the mechanism in which LLMs communicate, defining a rigid software development process, specification-driven design, builds upon the workflow by enforcing a requirement-code-test workflow. Natural language specifications have been shown to produce data structures and code in AIDI [14] including knowledge graphs and ontology of design decisions. ProjectGen decomposes tasks from requirements in a tree architecture. The work of Konda [13] argues the effective use of agents ability to maintain traceability from requirements engineering, testing, and deployment. Anecdotally, most projects lack the traceability from requirements through code and testing to verification and validation. Historically this is an intensive process to implement but the application of LAP/HASP narrows the traceability question to whether requirement and rationale markers embedded in the codebase improve later agent comprehension and modification.
Testing is an integral part of current agentic systems: AutoDev, AgileCoder, and out-of-the-box agentic systems such as Anthropic Claude Code [22] and OpenAI Codex [32] implement testing. Cloud automation of testing environment with agentic workflows is possible for projects needing scalable cloud resources [31]. Earlier work on multi-agent systems also shows that test-driven and V-model-style methods can be applied to agent systems through staged unit, integration, system, and acceptance testing [13]. In this study, HASP uses testing as a controlled evaluation and logging layer, while the maintenance agent remains a single-agent subject whose success or failure is measured against frozen tests.
However, evaluation methodology is a major gap [17]. Akshathala et al. argue agentic systems must test beyond binary test results or simple unit tests. Instead, testing must include tool use, memory behavior evaluation, environmental interaction, and behavioral deviations across runs. Existing software-agent benchmarks, including SWE-bench-style repair tasks and newer project-level benchmarks, primarily measure patch correctness, executability, or test success. These metrics are useful, however, they do not directly measure documentation-code drift, broken traceability links from requirements, preservation of design rationale, or maintainability during feature addition. The current literature lacks an evaluation of the effectiveness of LAP to improve human comprehension, decrease documentation drift, and decrease token usage rates.
The research gap addressed by this proposal is specific: no reviewed system directly evaluates whether lightweight literate-programming structures embedded in the source code and nearby documentation improve single-agent software maintenance by strengthening context and memory. Existing harness-engineering work generally treats documentation as secondary to source code, while commercial agent practice increasingly treats context quality as a first-order performance constraint. This proposal addresses that gap by comparing paired codebases with LAP elements retained versus stripped during feature-addition and bug-fix tasks, while using HASP only as the controlled experimental runner and audit layer.
Motivation
Software maintenance remains a difficult task because technical knowledge of projects is often scattered in many locations. There may be a requirements or design document, notes and decisions may be in issues, there may be minimal or insufficient comment in source code files, and much of the knowledge resided at one time in the memory of the original developer. These factors hinders software maintenance and makes it risky and harder to verify changes.
AI generated software suffers from the exact same problems at a much faster speed. Agents can produce code above human speed and the human becomes the bottleneck in reviewing and detecting issues in agent generated code. The current research field is attempting to reduce agentic issues that require a human-in-the-loop.
The current failures of agentic software development have a direct correlation to humans performing the same tasks, despite their super human coding capabilities. Agents have a limited memory, called context window, that they use while performing their work. The limited context window makes it difficult to perform long running tasks as they drift from the original task and exhibit an anxiety to finish early and report the task complete. The solution to these issues has recently been defined as Harness Engineering where multiple agents are organized to and tasked according to specific skills and their context windows. However, harness engineering is only one pillar. Even a well-designed harness still depends on the context available to each agent turn and on whatever memory survives across turns, summaries, structured files, or vector retrieval. In greenfield projects, the documentation can be generated as the agentic system plans the work, but as develop progresses, the original documentation easily experiences drift from the implementation. Brownfield projects are already plagued with documentation drift and agents rely on analyzing the code to gain an understanding of how to fix a bug or implement a new feature. It is currently not well understood how to maintain and leverage a sufficient amount of prose within a codebase to improve agent efficiency and reduce churn. This is where literate programming can be reframed as a disciplined way to place useful, inspectable memory inside the codebase itself.
This proposal aims to investigate which degree, structure, placement, and semantic content of literate decorators and documentation improve agentic source-code understanding and modification. The study does not treat comment density as the main variable. Instead, it tests whether meaningful LAP elements change agent behavior and output by giving the agent just enough task-relevant context to find the right files, preserve non-obvious constraints, and produce correct modifications with less rediscovery. LAP is evaluated as a context and memory treatment inside the codebase, while HASP supplies repeatable experiment setup, execution limits, logging, and scoring. This research does not set out to prove dominance of a particular model, agentic SDK, or harness design because all of these will continue to evolve as the LLMs improve.
The specific research questions to be answered are:
- What kinds of literate decorators and documentation provide useful context for an AI coding agent? Usefulness is measured by which it carries design intent, constraints, rationale, invariants, requirement links, or testing expectations that cannot be reliably inferred from local source code alone.
- When given an existing codebase, do agents modify LAP-documented source code more successfully than source code where the LAP-generated decorators and documentation have been removed?
- Which LAP structures provide the strongest effect: requirement decorators, module-level design notes, rationale comments, traceability links, change-history notes, or layered memory artifacts such as structured files and vector-retrievable summaries?
- Which placement strategy changes agent behavior and output most effectively: folder-level context files, source-file headers, function-level decorators, inline rationale comments, or lightweight relationship maps?
- Can compact agent-oriented context files at each folder level provide better navigation and task performance than human-oriented README files, while using fewer tokens?
- Can lightweight LAP relationship artifacts establish important links between distant files in a way that approximates the useful parts of a source-code graph without requiring a full graph database?
- Does LAP reduce token usage, context-compression events, repeated file reads, search churn, develop-test-refine iterations, and agent turns during feature addition and bug fixing?
- Is stripping LAP-generated decorators and documentation from a LAP-developed codebase a valid test for a codebase developed without LAP, or does the original LAP-guided generation leave architectural and naming traces that must be treated as a separate validity condition?
The research hypothesis is as follows:
- Agents working on LAP-documented codebases will complete feature-addition and bug-fix tasks with higher frozen-test pass rates than agents working on paired LAP-stripped codebases
- agents working on LAP-documented codebases will require fewer total tokens, fewer context-compression events, fewer repeated search/read cycles, and fewer develop-test-refine loops than agents working on paired LAP-stripped codebases.
- LAP elements that encode design rationale, invariants, requirement links, and test intent will outperform shallow comments that merely restate local code behavior.
- Folder-level agent context files and minimal source-file decorations will improve file-discovery behavior more than increasing comment density.
- Lightweight relationship artifacts that identify non-obvious links between distant files will reduce search churn and off-target edits during feature addition.
- The development process is an important factor. Using LAP in the project development may improve the design and coherence or the project, independent of the later benefits of context and memory for future agent work.
Agentic-harness engineering research is still an immature and rapidly changing field. Much of the current work appears in preprints, open-source repositories, and commercial documentation rather than in mature long-term empirical studies. For that reason, this proposal emphasizes controlled comparison, transparent variables, and reproducibility. Model versions, prompts, harness designs, tool permissions, logs, task definitions, containers, and analysis scripts will be documented and published with appropriate open-source licensing. This allows the study to focus on the observed effect of LAP as codebase context and memory rather than on temporary differences among model releases or commercial tools.
Proposed method
The experimentation method will use HASP as a controlled research tool to produce the projects with LAP elements, not during the feature addition work. There are two phases, first the projects will be generated from the concept and requirements documents using multiple LAP documentation/decorator strategies. Second, agents will perform feature addition and bug fix tasks against the project codebase variants where the source code is held constant and the LAP context layer is either present or stripped.
In summary, there are three primary experimental conditions:
- Projects developed with LAP include literate decorators, design notes, ontology, requirements links, test links, and structured memory files
- Projects with the LAP content removed from the source code and any other LAP created files.
- A control project developed with the same HASP design without LAP instructions. This tests whether projects stripped of LAP content is a fair comparison for non-LAP developed projects.
The same model versions, model settings, harness design, tools, scripts, task definitions, frozen tests, execution limits, and features are held constant within each paired comparison. The primary independent variable is the structure and presence of LAP context and memory artifacts in the codebase. A secondary topic researched as a basis for this proposal is to develop a properly structured and minimal specifications document format that is optimized for both human and agent use.
LAP Requirements Definition
In LAP-enabled project experiments, the generation process will be configured to apply predefined LAP documentation profiles. Profiles range in complexity from minimal requirement decorators, module-level design summaries, rationale and invariant comments, requirement-to-test links, change-history notes, and structured memory files that can later be indexed or embedded in a vector database.
A useful LAP element is any documentation that carries information a later agent or human reviewer cannot quickly or reliably infer from the local code alone. Useful LAP elements include purpose, design rationale, alternatives, domain constraints, requirement identifiers, API contracts, data-model requirements, testing intent, and links between files.
The LAP profiles will vary semantic content and placement rather than simply increasing the amount of documentation. Candidate LAP profile options include: folder-level agent context files that summarize the purpose of code in a directory; source-file headers that state purpose, ownership, requirements, important dependencies, and invariants; function-level decorators that identify requirement links, side effects, contracts, and relevant tests; and inline rationale markers used only where a local code fragment depends on non-local design knowledge.
The folder-level context file should be framed as context for agents rather than a conventional README. A README may be too large, prose-heavy, and token-expensive for this purpose. A compact file such as AGENT_CONTEXT.md or a small structured YAML/JSON equivalent may be more appropriate. Each context capsule should provide just enough information to route the agent: directory purpose, public entry points, files usually edited for common task types, important tests, non-obvious dependencies, local invariants, and cautions about files that should not be changed casually.
The source-file decoration strategy will test whether useful context belongs at the top of each source file, at each function, inside the code near non-obvious decisions, or in a hybrid of these locations. File-level context is expected to help with triage and navigation. Function-level decorators are expected to help with local modification and test targeting. Inline comments are expected to be valuable only when they explain why the implementation is constrained by distant code, domain rules, or prior design decisions.
LAP may also include lightweight relationship artifacts that approximate the useful portion of a source-code graph. Instead of requiring a full graph database in the initial study, each folder or module can maintain a small relationship map such as .lap/links.json, .lap/links.md, or a decorator block that records non-obvious links among requirements, modules, functions, tests, data models, and external contracts. The goal is to help an agent discover distant coupling that static file search may not reveal quickly.
LAP-enabled codebases will use literate programming decorators to support later lensing and graph-based organization of the code. These decorators are not intended to recreate traditional tangle and weave workflows. Instead, they provide structured markers that help agents and analysis tools connect implementation elements to requirements, design rationale, tests, and memory artifacts.
In LAP-stripped experiments, the source code generated under a LAP profile will be transformed by removing the LAP decorators and LAP documentation layer. This removes the LAP artifacts while preserving the implementation produced during the original development process. Because this does not prove equivalence to a codebase developed without LAP, an organically non-LAP control baseline will be used as well.
A key risk is documentation staleness. If an agent modifies LAP-retained code without recognizing that the corresponding context capsule, decorator, or relationship map must also be updated, the LAP layer can quickly become misleading. Therefore, LAP-retained codebases should include minimal in-repository maintenance instructions that tell the agent which local LAP artifacts must be checked after code changes. Documentation drift and stale relationship links will be measured as part of maintenance quality rather than assumed away.
HASP Design
The HASP system will be implemented as a lightweight research harness rather than as a full commercial development platform. Keeping the harness small supports transparency and makes the experiment easier to inspect and reproduce. The implementation will use the Claude Agent SDK because it provides file read, write, and edit tools; shell command execution; code search; subagent invocation; tool-permission controls; and usage tracking. The research harness will add LAP-specific logic on top of those capabilities. HASP is responsible for the project preparation, run isolation, agent invocation, stopping rules, logging, artifact hashing, and final scoring. It will not add a multi-agent advantage during the feature-addition task because that would confound the effect of the LAP context layer.
Each experiment will run inside an isolated project container. The original test-case files and frozen tests will remain unchanged, and each run will create a separate working copy of the project. This prevents cross test contamination.
During project generation, HASP may use a controlled multi-agent, scripted workflow to create source projects from the concept and requirements documents. This phase can still use architecture, planning, implementation, testing, review, and documentation roles because initial code generation is not the primary outcome being tested. The generated codebases must pass frozen tests before they enter the maintenance corpus.
During feature-addition and bug-fix evaluation, HASP will invoke a single coding agent per run. The agent will receive the same feature request, the same allowed tools, the same execution limits, and the selected codebase variant. HASP will record what the agent reads, edits, searches, tests, summarizes, compresses, and retries. Failed tests may trigger bounded develop-test-refine loops until the run passes required criteria or reaches predefined stopping rules.
The implementation will avoid unnecessary complexity such as a general-purpose dashboard, public benchmark comparisons, and open-ended autonomous agent-to-agent chat. The purpose of HASP is to provide a controlled research environment for evaluating LAP as context and memory, not to demonstrate a full production agent platform.*
Experiment Preparation Steps
The initial evaluation will use custom research projects rather than a large public benchmark suite. Custom projects keep the preliminary work feasible and allow the study to measure LAP-specific outcomes that public benchmarks do not directly measure. Public benchmarks such as Commit0, FeatureBench, and SWE-bench-style datasets are useful related evaluation approaches, but they are not required for the proposal-stage study because they are not designed to measure documentation-code drift, traceability quality, or preservation of design rationale.
The experiment will use several project types with increasing difficulty. All will have defined maintenance orientated tasks to perform.
- Project A will be a simple command-line application with clear input and output behavior, persistent local storage, unit tests, and a small feature-addition task.
- Project B will be a small API or service application with multiple modules, validation logic, persistence, integration tests, and a bug-fix or API-extension task.
- Project C will be a medium UI plus backend application where frontend behavior, API contracts, backend state, and tests must remain synchronized during a feature addition.
For each project, the preparation workflow consists of
- LAP profile variants that isolate semantic content and placement, such as folder context only, file headers only, function decorators only, inline rationale only, lightweight relationship maps only, and combined minimal LAP.
- Concept documents, requirements, architecture expectations, and frozen tests for each project
- Generating project files for each LAP configuration.
- Removing LAP-specific information from the projects while preserving executable behavior.
- Generate non-LAP projects from the same requirements when resources allow to measure whether LAP-guided development changes architecture, naming, decomposition, or testability even after the documentation layer is removed.
- Maintenance task packages (feature additions and bug fixies) before single-agent evaluation begins.
Test Case Artifacts
Each custom project will include a concept document, functional requirements, and acceptance criteria. The concept document will describe the purpose of the software and the user-facing behavior expected from the system. The functional requirements will define the specific capabilities that must be implemented. The acceptance criteria will define the observable conditions that determine whether the task is complete. ach task package will also include a frozen starting codebase variant, a feature request or bug report, frozen regression tests, folder-level context artifacts where applicable, source-file decorators where applicable, relationship maps where applicable, a manifest describing which LAP profile was used, and a record of how LAP elements were stripped for the paired control.
The artifact set should distinguish three knowledge layers: executable source code, human-readable LAP documentation, and machine-readable memory artifacts. Machine-readable memory may begin as structured files such as JSON trace maps, project indices, folder context files, or lightweight relationship graphs, and later be indexed into a vector database. The study should record whether the agent used those memory artifacts directly or whether they merely shaped the local source-code context.
Execution workflow
Each evaluation run will begins with one of the project variants. A script will copy the selected variant into an isolated workspace, provide the same feature request or bug report to a single agent, apply the HASP configuration for metric collection, and allow the agent to inspect, modify, and test the code under the same limits across conditions.
The evaluation phase will run frozen tests and compute final metrics for the study. These metrics will include correctness, token usage, context-compression events, number of turns, repeated file reads, search behavior, failed test cycles, documentation drift, traceability preservation where applicable, and final artifacts. Retaining both successful and failed runs is important because failed runs may reveal qualitative differences between LAP-retained and LAP-stripped workflows that would be hidden if only successful completions were reported.
Metrics
The metric plan follows empirical software-engineering practice by defining measurable outcomes before the study and retaining failed runs rather than reporting only successful completions [26]. The primary metrics are:
- Correctness, defined by frozen-test pass rate and whether all required acceptance tests pass after the feature addition or bug fix.
- Token cost, measured as total input and output tokens used per maintenance task. Token cost is important because LAP may add documentation overhead, but it may reduce rediscovery and repair cycles by giving agents clearer persistent context.
- Context-management cost, measured through context-compression events, repeated file reads, search queries, folder-context accesses, relationship-map accesses, summary writes, memory-file accesses, and turns required before the first source edit.
- Agent behavior change, measured by first file edited, number of files inspected before the correct edit location is found, whether the agent uses the intended LAP artifact, whether it edits unrelated files, and whether it preserves non-obvious constraints referenced by LAP artifacts.
- Maintenance quality, measured by regressions, documentation drift, stale relationship links, broken trace count, and whether the final change preserves the design rationale associated with affected requirements.
Other metrics to be collected are wall-clock time, normalized source-file edits, the number of develop-test-revise cycles, the number of tool or skill calls, comment/decorator update rate, and whether the agent updates affected context capsules or relationship maps after changing code. Normalized source-file edits will be calculated as the number of source-file edits divided by the total number of source files.
Several metrics are intentionally excluded from the study. Human comprehension ratings, subjective maintainability scores, and open-source benchmark leaderboard comparisons will not be included in the initial evaluation.
Logging and reproducibility
Each run will record a machine-readable run manifest designed into the harness and SDK. The manifest will include the project ID or name, date and time, model name and version, harness version, prompt and instruction references, tool and skill configuration, wall-clock limit, container image or environment description, final test report, traceability report where applicable, documentation-drift result where applicable, total tokens, and model settings such as temperature or effort level. The manifest will also identify the LAP documentation profile, whether the run used LAP-retained, LAP-stripped, or non-LAP source, the stripping procedure used, which placement strategy was active, which relationship artifacts were present, and whether structured files or vector retrieval were available as memory layers. This manifest will make it possible to compare runs and audit the experimental conditions.
Each run will also record a JSONL event log with one record per agent turn or tool action. Each event record will include the agent role, action type, input and output token counts, test status before and after the action when applicable, artifact hashes before and after the action, and any error or failure messages. The event log will also record context-compression events, summaries written by the agent, memory lookups, folder-context reads, relationship-map reads, source-file decorator reads, files read before editing, repeated reads, search queries, and the point at which the agent first identifies the relevant implementation surface. The purpose of this logging is to make preliminary results auditable and to support later replication. Publishing controlled variables, logs, task definitions, containers, and analysis scripts will also help address the immaturity and rapid change of the agentic software-development field.
Preliminary results
References
[1] H. Li, H. Zhang, and A. E. Hassan, “The rise of AI teammates in software engineering (SE) 3.0: How autonomous coding agents are reshaping software engineering,” arXiv:2507.15003, 2025. doi: 10.48550/arXiv.2507.15003.
[2] W. Nasir and N. Kallinteris, “From code generation to AI collaboration: The role of multi-agent systems in software engineering,” ResearchGate, Feb. 2025. doi: 10.13140/RG.2.2.21102.32320.
[3] D. E. Knuth, “Literate programming,” The Computer Journal, vol. 27, no. 2, pp. 97-111, 1984. doi: 10.1093/comjnl/27.2.97.
[4] Z. Rasheed, M. A. Sami, M. Waseem, K.-K. Kemell, X. Wang, A. Nguyen, K. Systä, and P. Abrahamsson, “AI-powered code review with LLMs: Early results,” arXiv:2404.18496, 2024. doi: 10.48550/arXiv.2404.18496.
[5] R. Sapkota, K. I. Roumeliotis, and M. Karkee, “Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI,” arXiv:2505.19443, 2025. doi: 10.48550/arXiv.2505.19443.
[6] H. N. Phan, T. N. Nguyen, P. X. Nguyen, and N. D. Q. Bui, “HyperAgent: Generalist software engineering agents to solve coding tasks at scale,” arXiv:2409.16299, 2025. doi: 10.48550/arXiv.2409.16299.
[7] M. H. Nguyen, T. P. Chau, P. X. Nguyen, and N. D. Q. Bui, “AgileCoder: Dynamic collaborative agents for software development based on agile methodology,” in 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), 2025, pp. 156-168. doi: 10.1109/Forge66646.2025.00026.
[8] J. P. Paz Grau and A. Castillo Sanz, “A test driven development of MAS,” in Proc. 1st Int. Workshop on Engineering Multi-Agent Systems (EMAS 2013), CEUR Workshop Proceedings, vol. 1113, 2013, pp. 185-200.
[9] M. Tufano, A. Agarwal, J. Jang, R. Zilouchian Moghaddam, and N. Sundaresan, “AutoDev: Automated AI-driven development,” arXiv:2403.08299, 2024. doi: 10.48550/arXiv.2403.08299.
[10] Q. Zhao et al., “Towards realistic project-level code generation via multi-agent collaboration and semantic architecture modeling,” arXiv:2511.03404, 2025. doi: 10.48550/arXiv.2511.03404.
[11] J. He, C. Treude, and D. Lo, “LLM-based multi-agent systems for software engineering: Literature review, vision and the road ahead,” ACM Transactions on Software Engineering and Methodology, 2024. doi: 10.1145/3712003.
[12] H. Wang, J. Gong, H. Zhang, J. Xu, and Z. Wang, “AI agentic programming: A survey of techniques, challenges, and opportunities,” arXiv:2508.11126, 2025. doi: 10.48550/arXiv.2508.11126.
[13] R. Konda, “Agentic AI for software development: Autonomous agents in requirements engineering, testing, and deployment,” International Journal of Emerging Research in Engineering and Technology, 2025. doi: 10.63282/3050-922X.AECTIC-118.
[14] M. A. Matties, “Autonomous intelligent software development,” arXiv:2208.06393, 2022. doi: 10.48550/arXiv.2208.06393.
[15] K. Khemani, “Self-programming AI: Code-learning agents for autonomous refactoring and architectural evolution,” Research Square, 2025. doi: 10.21203/rs.3.rs-6688473/v1.
[16] N. Collier and J. Ozik, “Test-driven agent-based simulation development,” in Proc. 2013 Winter Simulation Conference (WSC), 2013, pp. 1551-1559. doi: 10.1109/WSC.2013.6721538.
[17] S. Akshathala, B. Adnan, M. Ramesh, K. Vaidhyanathan, B. Muhammed, and K. Parthasarathy, “Beyond task completion: An assessment framework for evaluating agentic AI systems,” arXiv:2512.12791, 2025. doi: 10.48550/arXiv.2512.12791.
[18] N. Otoum and N. Elkhalili, “Methods and techniques of agentic software engineering: A systematic literature review,” IEEE Access, vol. 14, pp. 7443-7467, 2026. doi: 10.1109/ACCESS.2026.3652325.
[19] L. Györy, “SLA-driven orchestration of long-running multi-agent enterprise workflows,” Zenodo, 2025. doi: 10.5281/zenodo.17395641.
[20] V. Erol, “Agent-oriented architecture: An analysis on contemporary industrial and academic developments,” Preprints, 2025. doi: 10.20944/preprints202509.2124.v1.
[21] Anthropic, “Building effective agents,” Dec. 19, 2024. [Online]. Available: https://www.anthropic.com/engineering/building-effective-agents
[22] Anthropic, “Best practices for Claude Code,” Anthropic Docs. [Online]. Available: https://code.claude.com/docs/en/best-practices
[23] Anthropic, “Extend Claude with skills,” Anthropic Docs. [Online]. Available: https://code.claude.com/docs/en/skills
[24] Anthropic, “How we built our multi-agent research system,” Jun. 13, 2025. [Online]. Available: https://www.anthropic.com/engineering/multi-agent-research-system
[25] OpenAI, “A practical guide to building agents,” 2025. [Online]. Available: https://openai.com/business/guides-and-resources/a-practical-guide-to-building-ai-agents/
[26] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering. Berlin, Germany: Springer, 2012. doi: 10.1007/978-3-642-29044-2.
[27] J. Hamer, “Literate programming: A software engineering perspective,” in Proc. Software Education Conf. (SRIG-ET’94), Nov. 1994, pp. 282-288.
[28] R. Gentleman and D. Temple Lang, “Statistical analyses and reproducible research,” Journal of Computational and Graphical Statistics, vol. 16, no. 1, pp. 1-23, 2007.
[29] Z. Du, C. Qian, W. Liu, Z. Xie, Y. Wang, R. Qiu, Y. Dang, W. Chen, C. Yang, Y. Tian, X. Xiong, and L. Han, “Multi-agent collaboration via cross-team orchestration,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 10386-10406.
[30] Borghoff, U. M., Bottoni, P., & Pareschi, R. (2025). Beyond prompt chaining: The TB-CSPN architecture for agentic AI. Future Internet, 17(8), 363. https://doi.org/10.3390/fi17080363
[31] Christadoss, J., Das, D., & Muthusamy, P. (2025). AI-agent driven test environment setup and teardown for scalable cloud applications. Journal of Knowledge Learning and Science Technology, 4(3), 1-17. https://doi.org/10.60087/jklst.v4.n3.001
[32] OpenAI. (2025, May 16). Introducing Codex. https://openai.com/index/introducing-codex/
[33] Chakraverty, S., Sahoo, D. M., & Mahato, N. R. (2019). McCulloch–Pitts neural network model. In Concepts of soft computing: fuzzy and ANN with programming (pp. 167-173). Singapore: Springer Singapore.
[34] LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1989). Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2.