Shashi Jagtap

Optimizing Databricks Omnigent Agents with MetaHarness

Shashi Jagtap — Sat, 13 Jun 2026 23:25:25 GMT

Databricks has just released the Omnigent the meta-harness for the AI agents. It is timely that Superagentic AI released meta-harness library few months ago. These meta-harness concepts sounds same but they serve different purpose. Agent development is moving up a layer. For the last few years, most agent work has focused on prompts, models, and tool calls. That is still important, but it is no longer the whole system. Real agent workflows now include harnesses, runtime policies, sandboxing, shared sessions, multiple models, skill files, approval gates, and evaluation loops.

That is why Omnigent is an important release. Omnigent is a meta-harness for AI agents. It gives developers a common layer over existing agents such as Claude Code, Codex, Pi, and custom agents. Instead of treating every agent harness as a separate silo, Omnigent provides a shared runtime layer for composition, control, and collaboration.

What Is Omnigent?

Databricks introduced Omnigent as an open source meta-harness for combining, controlling, and sharing agents. The basic idea is agent harnesses made models usable, but the next layer is the meta-harness, a layer above individual harnesses where teams can manage multiple agents, policies, sessions, and collaboration.

Omnigent focuses on three core ideas.

Composition: use one common layer across multiple harnesses and models. You can switch between Codex, Claude Code, Pi, or custom YAML agents without rewriting the whole workflow.
Control: enforce policies, approvals, cost limits, filesystem access, network access, and sandboxing outside the prompt. This matters because prompt-only safety is not enough for serious agent workflows.
Collaboration: share live agent sessions through terminal, web, desktop, mobile, or APIs so teams can inspect, comment, and steer work together.

This is a different layer than a single coding agent. Omnigent is not just another assistant. It is infrastructure around agents. It wraps agent execution, gives it a common interface, adds policy and sandbox controls, and makes sessions shareable.

That matters because teams rarely use one model in one harness forever. They compare models. They combine agents, run subagents, need approvals, logs, durable sessions. They need to control what agents can read, write, and spend. Omnigent is designed for that world.

The Missing Piece: Optimization

Once agents become file-backed and declarative, a new question appears: how do you know the agent definition is good? This is where optimisation fits in in the launch blog post, they mentioned optimisation is on their roadmap to include automatic optimization at the meta-harness level with GEPA, code-based introspection within agents similar to MemEx and RLM, but he’s still not happened yet.

If your Omnigent agent is defined by files such as config.yaml, AGENTS.md, skills, policies, and sandbox declarations, those files become part of the system. They influence how the agent behaves just as much as a prompt does.

That creates practical questions:

Is the agent instruction file specific enough?
Does the sandbox allow the right files and block the wrong ones?
Are the policies aligned with the task?
Are skills useful, or are they vague placeholders?
Can one agent configuration be measured against another?
Can we improve the agent definition with evidence instead of intuition?

This is where MetaHarness fits. Superagentic AI belt Marta harness library few months ago to optimise the coding agents this library is perfectly fit to optimise the Omnigent as well.

What Is MetaHarness?

MetaHarness is an open source Python library for optimizing executable harnesses around agentic coding systems. It is inspired by the Meta Harness research direction, but it is built as a practical filesystem-first tool for developers. MetaHarness does not only optimize prompts. It optimizes the files and workflows around agents: instruction files, config files, setup scripts, validation scripts, test flows, skills, and policy definitions.

The loop is straightforward:

Start from a baseline workspace.
Create a candidate workspace.
Ask a proposer backend to improve the candidate.
Validate and evaluate the result.
Store diffs, events, scores, ledgers, and artifacts.
Keep the best candidate.

This makes agent improvement measurable. Instead of manually editing an agent config and hoping it is better, MetaHarness turns the harness into an experiment target.

MetaHarness 0.4.0 Adds Omnigent Support

We have released MetaHarness 0.4.0, now also available on PyPI. This release adds an experimental Omnigent backend.That means MetaHarness can now use Omnigent as a proposer backend. In practical terms, Omnigent can propose changes to a candidate workspace, while MetaHarness keeps the surrounding optimization loop: candidate workspaces, validation, scoring, diffs, ledgers, frontier selection, and result artifacts.

uv tool install superagentic-metaharness

Or, inside a Python project:

uv add superagentic-metaharness

How The Integration Works

The new backend is called OmnigentCliBackend. It plugs into the same proposer protocol used by the rest of MetaHarness:

prepare(request) -> invoke(request) -> collect(execution)

For each candidate, MetaHarness can generate an Omnigent agent bundle under the candidate workspace:

.metaharness/omnigent_agent/config.yaml

It also archives the generated config for review:

proposal/omnigent_agent.yaml

A simplified generated config looks like this:

spec_version: 1
name: metaharness_candidate_proposer
instructions: .metaharness/AGENTS.md

executor:
  type: omnigent
  config:
    harness: codex

os_env:
  type: caller_process
  cwd: /absolute/path/to/candidate/workspace
  sandbox:
    type: darwin_seatbelt
    allow_network: true
    write_paths:
      - AGENTS.md
      - config.yaml
      - skills

One important detail is the working directory. The generated Omnigent config pins os_env.cwd to the absolute candidate workspace. That keeps the agent bundle separate from the workspace being optimized.

Policy And Sandbox Mapping

MetaHarness already supports allowed write scopes. With the Omnigent backend, those scopes can be translated into Omnigent sandbox configuration. For example, a MetaHarness project can define the files an agent is allowed to edit:

{
  "allowed_write_paths": [
    "config.yaml",
    "AGENTS.md",
    "skills"
  ]
}

The Omnigent backend maps those paths into the generated sandbox config and adds a sandbox enforcement policy:

policies:
  metaharness_enforce_sandbox:
    type: function
    handler: omnigent.policies.builtins.safety.enforce_sandbox
    factory_params:
      sandbox_type: darwin_seatbelt
      allow_network: true
      write_paths:
        - config.yaml
        - AGENTS.md
        - skills

This is a useful combination. Omnigent can enforce safety earlier during the agent run, while MetaHarness still validates the final candidate after the run finishes.

Using Omnigent As A MetaHarness Backend

A project can configure the Omnigent backend in metaharness.json:

{
  "backends": {
    "omnigent": {
      "omnigent_binary": "omni",
      "harness": "codex",
      "sandbox_type": "darwin_seatbelt",
      "allow_network": true,
      "no_session": false,
      "proposal_timeout_seconds": 300
    }
  }
}

Then run:

uv run metaharness run \
  examples/omnigent_agent_benchmark \
  --backend omnigent \
  --budget 1

This uses Omnigent to generate a candidate improvement, then MetaHarness validates and scores that candidate.

The Bigger Use Case: Optimizing Omnigent Agents

The most interesting long-term use case is not only using Omnigent to run a proposal. It is using MetaHarness to optimize Omnigent agents themselves. Omnigent agents are file-backed and declarative. That makes them a natural optimization target.

MetaHarness can optimize files such as:

config.yaml
AGENTS.md
skills/*/SKILL.md
custom policy config
sandbox declarations
tool declarations
subagent instructions

This creates a clean division of labor:

Omnigent runs and controls agents. MetaHarness measures and improves the files that define those agents.

A Real Smoke Test with Codex Harness

For the 0.4.0 release, we added a new benchmark:

examples/omnigent_agent_benchmark

The baseline agent was intentionally minimal. It had a basic Omnigent config, a sparse instruction file, and a placeholder review skill. The benchmark asked MetaHarness to improve the agent definition.

We then ran a real Omnigent smoke test through the new backend. The result:

best_candidate_id=c0001
best_objective=0.875
scope_violation_paths=[]

The winning candidate changed only the intended files:

AGENTS.md
config.yaml
skills/review/SKILL.md

That is the behavior we wanted. Omnigent generated a useful candidate improvement. MetaHarness validated it, scored it, recorded the diff, and confirmed that the candidate stayed inside the allowed write scope.

What Changed In The Candidate?

The improved candidate made the agent more explicit and safer.

config.yaml was updated with a stronger agent identity, an explicit instruction file, sandbox configuration, and policy shape.
AGENTS.md was expanded with practical repository guidance, safe git behavior, and instructions to inspect MetaHarness artifacts.
skills/review/SKILL.md was changed from a placeholder into a concrete review checklist focused on correctness, tests, security, and maintainability.

This is exactly the type of improvement that is hard to reason about from vibes alone. It is much better to score it with a repeatable benchmark.

What We Learned

The first real integration surfaced several useful engineering lessons.

Omnigent expects agent bundles to be structured as directories with config.yaml, not arbitrary one-off YAML files.
The generated os_env.cwd must point to the candidate workspace, not the generated agent bundle directory.
Runtime scratch files from Codex under Omnigent should not pollute candidate diffs, so MetaHarness now cleans private .codex-tmp scratch before computing workspace changes.
In our local setup, codex-native required tmux, so the verified smoke used Omnigent’s codex harness path.
The integration works best when Omnigent owns runtime execution and MetaHarness owns experiment structure, scoring, and artifact tracking.

Why Otimization Matters

Agent builders are going to need two layers. First, they need a runtime layer. That is what Omnigent provides: a common interface, shared sessions, policy enforcement, sandboxing, and collaboration across agent harnesses. Second, they need an optimization layer. That is what MetaHarness provides: a repeatable loop for improving the files that shape agent behavior.

Together, they form a practical workflow:

Define an Omnigent agent in files.
Run it through MetaHarness.
Let an agent propose improvements.
Validate and score the result.
Keep the better candidate.
Repeat with evidence.

This is useful because serious agent development is not only about choosing the best model. It is about improving the system around the model: instructions, tools, policies, skills, workflows, and validation.

Try It

Install the latest release from PyPI:

uv tool install superagentic-metaharness

Or upgrade an existing install:

uv tool upgrade superagentic-metaharness

Read the MetaHarness 0.4.0 release notes, browse the MetaHarness repository, and try the Omnigent backend with the example benchmark. To learn more about Omnigent itself, start with omnigent.ai and the Databricks launch post.

Closing Thought

Omnigent gives agent builders a portable runtime layer. MetaHarness gives that layer a feedback loop. That combination is the interesting part. Agents can now be composed, governed, shared, measured, and improved as file-backed systems. That is the direction agent engineering is moving, and MetaHarness 0.4.0 is a first step toward making Omnigent agents optimizable with evidence.

Superagentic AI at the Google I/O and CAIS Conference: Reflections from Bay Area

Shashi Jagtap — Sun, 31 May 2026 13:00:13 GMT

Last week, I returned to London from the San Francisco Bay Area with renewed energy, fresh insights, and a deep appreciation for the people driving artificial intelligence forward. Since back, suffering from Jet lag and very bad cough and cold but I will deal with it. From May 17 to May 27, 2026, I immersed myself in a series of high-caliber events, including Google I/O 2026, the inaugural ACM Conference on AI and Agentic Systems (CAIS), the AWS Builders Showcase, the Google DeepMind Hackathon and Post I/O event, and several intimate founder gatherings in the San Francisco and Bay Area.

While the technical announcements and research presentations were outstanding, the true value of the trip came from the conversations, connections, and shared vision among builders from cafes, researchers in universities and CAIS.. The Bay Area remains a singular place where ambitious ideas meet world-class talent, and this week reinforced why it continues to shape the future of technology.

Trip

A Practical Beginning: AWS Partner Showcase SF

The trip started the moment I landed on May 18. After collecting my Google I/O badge, I went straight to the AWS Builders Showcase, hosted by AI Camp. This event offered the perfect entry point, focusing on real-world implementation rather than theoretical concepts.. Attendees and speakers tackled deployment challenges, evaluation frameworks, observability, and how agentic systems deliver measurable business value. I had the opportunity to meet Jason Lopatecki, CEO of Arize AI, along with professionals from Mistral AI, Coder, and other organizations. The collaborative atmosphere set a strong foundation for the rest of the week. Key highlights from the event was meeting Jason Lopatecki and Jady Liu who spoke mostly on the Agent Harnesses. Also glad catching up other speakers Rahul, Rohan and Rob to hear insights from their talks. Had fun visiting all the booths including Arize AI, Coder, Workato.

Google I/O 2026: Witnessing the Future Unfold

Google I/O 2026 formed the heart of the trip. The keynotes showcased substantial advancements in Gemini models, multimodal capabilities, agentic workflows, and developer tools. It felt less like viewing a distant roadmap and more like experiencing innovation in real time, particularly with progress in areas like Gemini Omni and enhanced agent orchestration with managed agents.

A personal highlight was meeting Demis Hassabis, CEO of Google DeepMind. Having followed DeepMind’s work for years, the chance to speak with him briefly and take a selfie together was genuinely memorable. It humanized the rapid progress we often see only through screens. Being Londoner, his vision and passion to stay committed to the London always inspired me. I felt super lucky to take picture with him. I dedicated significant time to the exhibition pavilions, especially the AI Pavilion, A2A /ADK and Quantum AI demonstrations. Speaking with Ian and the team showcasing the Gemma models provided practical context that complemented the stage presentations. The evening block party created a relaxed environment for deeper discussions with developers, researchers, and product teams from Google and beyond. These informal moments often yielded the most valuable perspectives.

The Real Value: Conversations and Community

Throughout the week, I attended additional sessions, networking events, and founder meetups across San Francisco. One evening at a Bright Data event, I connected with the CTO of Hugging Face Thomas Wolf, Laura Modiano and several impressive founders. Many other meaningful interactions occurred in cafes, lobbies, and shared spaces, including Arize AI-related gatherings.

These conversations reinforced a vital lesson: innovation thrives through collaboration. Whether challenging assumptions or sharing hard-won lessons from production environments, the exchange of ideas accelerated understanding in ways no single keynote could achieve. The industry is clearly entering a phase where orchestration, evaluation, and system design matter as much as raw model capabilities.

Friday’s AI in Production mini-conference deepened this focus on execution, monitoring, and scaling. The following day at the Google DeepMind Hackathon, the hands-on energy was invigorating. Collaborating with talented engineers and researchers on rapid experimentation and iteration brought the week’s themes to life.

Moments of Reflection and Broader Perspectives

After days of intense activity, Sunday offered a welcome balance. Exploring San Francisco, including a walk across the Golden Gate Bridge, provided space to reflect on the week’s insights amid the city’s iconic scenery. On Monday, visiting a friend at Stanford University offered another valuable lens. Campus conversations about research and startups bridged academic inquiry with practical application, highlighting both challenges and opportunities ahead.

CAIS 2026: Where Research Meets Practice

The trip culminated at the inaugural ACM Conference on AI and Agentic Systems (CAIS) on May 26. Focused on compound AI architectures, optimization, evaluation, and reliable deployment, the event brought together rigorous research with real-world engineering needs. I was fortunate to meet researchers behind influential projects such as DSPy and GEPA, along with contributors from the Laude Institute and other leading institutions. Omar Khattab and Lakshya Agrawwal are my idols and I felt incredibly lucky to meet them in person. I couldn’t take pictures with them in this trip.. I also meet with some incredible researchers from the Laude Lounge and author of meta-harness paper, Yonhoo Lee.

Key Takeaways from the Bay Area

This week crystallized several important themes:

Agentic AI is maturing rapidly and Agent Engineering is evolving. The focus has shifted from experimentation to production readiness, with strong emphasis on reliability and measurable outcomes.
Evaluation and observability are now central. Understanding how systems perform, fail, and improve has become a foundational discipline.
Compound architectures and harness Engineering represent the next frontier. Success increasingly depends on intelligently orchestrating multiple specialized components rather than relying on a single model.
Human connections remain irreplaceable. Technology advances at an astonishing pace, yet the most meaningful progress emerges from people willing to share knowledge and build together.

Looking Forward

I returned home energized and more optimistic than ever about the trajectory of artificial intelligence. The Bay Area continues to serve as a global nexus for those shaping this future, and participating in this ecosystem was both humbling and inspiring. I am grateful to the organizers, speakers, and everyone who shared their time and insights. If our paths crossed during the week, thank you for the conversation. If you are building in the agentic AI space, I would welcome the opportunity to connect and continue the dialogue.

The future of AI will not be defined by models alone, but by the communities that bring them to life. After this week in the Bay Area, I am more committed than ever to being part of that journey. I look forward to applying these experiences in my work and staying connected with the remarkable people I met. Here’s to building responsibly and collaboratively in the years ahead.

PyFlue 0.2.0: Bringing Flue's Agent Runtime Model to Python

Shashi Jagtap — Sun, 31 May 2026 10:30:24 GMT

PyFlue 0.2.0 is now available. This release is a major step toward parity with the TypeScript Flue framework and introduces a clearer runtime model for building production-oriented Python agents. The main theme of this release is structure. PyFlue now separates long-lived agents from finite workflows, adds persistent agent instances, introduces asynchronous dispatch, improves observability, and makes Pydantic AI the default harness backend.

PyFlue remains focused on Python teams that want framework-level ergonomics for agents while keeping access to the Python ecosystem, Pydantic models, Python deployment targets, and familiar server infrastructure.

Agents and Workflows

The largest change in PyFlue 0.2.0 is the adoption of the Flue agents and workflows architecture. An agent is a persistent, addressable runtime boundary. It is useful when you want one identity to keep working over time, such as a support assistant for a ticket, a coding assistant for a repository, or a chat assistant for a thread. A workflow is a finite unit of work. It runs once, produces a result, and has a workflow run id that can be inspected through run APIs and logs.

  from pyflue import create_agent

  default = create_agent(
      lambda ctx: {
          "model": "openai/gpt-5.5",
          "instructions": f"Help with the request for instance {ctx.id}.",
      }
  )


  from pyflue import FlueContext, create_agent

  agent = create_agent(lambda ctx: {"model": "openai/gpt-5.5"})


  async def run(ctx: FlueContext) -> dict:
      harness = await ctx.init(agent)
      session = await harness.session()
      response = await session.prompt(ctx.payload["text"])
      return {"summary": response.text}

This gives application authors a clearer way to choose the right abstraction. Use agents when identity and continuity matter. Use workflows when the unit of work is bounded and result-oriented.

Persistent Agent Instances

PyFlue agents are now addressable through stable instance ids.

POST /agents/{name}/{id}

The {id} value identifies the continuing agent instance. It can represent a customer, repository,
ticket, chat thread, or any other application-defined boundary.

Each instance can maintain separate sessions. This lets applications model long-running interaction without
inventing their own session layer around the agent runtime.

Persistent agents can be used over HTTP, Server-Sent Events, and WebSocket connections. The Python client also
includes agent helpers for invoking and connecting to deployed agents.

Asynchronous Dispatch

PyFlue 0.2.0 adds dispatch() for asynchronous agent input. This is useful when an application receives an event but should not keep the inbound request open while the model works. Common examples include webhooks, queue messages, issue comments, and chat events.

from pyflue import dispatch
  from src.agents.support_assistant import default as support_assistant


  async def accept_comment(event):
      return await dispatch(
          support_assistant,
          id=event["customer_id"],
          session=event["case_id"],
          input={
              "type": "support.comment.created",
              "text": event["text"],
          },
      )

dispatch() returns a receipt once the input has been accepted. The agent processes the input through its own session and tools. For production systems, dispatch should be placed behind a durable application queue when accepted work must survive restarts. The current Python dispatch path uses process-memory admission, which is appropriate for local development and simple deployments, but not a replacement for durable message delivery.

Pydantic AI Is Now the Default Harness

PyFlue 0.2.0 makes Pydantic AI the default harness backend. This means a default PyFlue install now uses a typed, model-agnostic Python agent loop without requiring LangChain or LangGraph. DeepAgents remains supported, but it is now optional:

pip install "pyflue[deepagents]"

Then select it explicitly:

agent = await init(harness="deepagents")

This change keeps the default dependency footprint smaller and better aligned with Python users who want Pydantic-first agent development.

Better Tooling and Customization

This release adds several improvements for defining agent behavior and capabilities.

Agent Profiles

Reusable profiles let teams share model-facing behavior across agents, workflows, and subagents.

from pyflue import define_agent_profile

  reviewer = define_agent_profile(
      {
          "name": "reviewer",
          "model": "anthropic/claude-sonnet-4-6",
          "instructions": "Review the change and report evidence-backed findings.",
      }
  )

Profiles can define models, instructions, tools, skills, subagents, reasoning effort, and compaction behavior.

Subagent Profiles

Agents can now delegate focused work to declared subagent profiles. This supports patterns such as review agents, research agents, data analysis agents, and scoped coding assistants.

Packaged Skills

load_skill(path) imports an Agent Skills compatible SKILL.md file as a reusable skill object. This improves the path for packaging skills with applications rather than relying only on workspace- discovered skills.

Local Sandbox

The new host local() sandbox gives agents controlled access to the host filesystem and subprocess shell. This is useful in CI runners, local automation, and trusted environments where the host itself is the isolation boundary. Remote sandbox adapters remain available for Daytona, E2B, Modal, and Runloop.

WebSocket Support

PyFlue now supports WebSocket surfaces for both agents and workflows. Persistent agents can use WebSockets for multi-prompt sessions over one connection. Workflows can use WebSockets for one invocation that streams events and returns the final result.

async with client.agents.connect("assistant", "thread-123") as conn:
      first = await conn.prompt("Summarize the context.")
      second = await conn.prompt("Draft a reply.")

async with client.workflows.connect("summarize") as conn:
      messages = await conn.run({"text": "..."})

Observability and OpenTelemetry

PyFlue 0.2.0 adds a more complete event model and an OpenTelemetry adapter. Agent operations, workflow runs, model generations, tools, tasks, and compaction events can now be correlated through structured events. The OpenTelemetry adapter maps these events to spans, including model and token usage metadata.

from pyflue import create_opentelemetry_observer, init

  agent = await init(
      on_event=create_opentelemetry_observer(),
  )

This makes it easier to inspect what agents are doing in production, including which operations ran, which tools were called, and how model usage accumulated.

Client Improvements

The Python client now mirrors more of the Flue SDK structure. It includes namespaces for agents, workflows, runs, and admin APIs. Persistent direct agent calls return result envelopes. Workflow invocations return run ids that can be inspected through run APIs. This distinction is important. Runs belong to workflows. Direct persistent agent prompts continue sessions and do
not create workflow runs.

Improved Project Layout

New projects use the src/ layout by default:

AGENTS.md
  pyflue.toml
  .agents/
    roles/
    skills/
  src/
    agents/
      assistant.py
    workflows/
      summarize.py

PyFlue still supports legacy file-based handlers, root-level agent files, and .pyflue or.agents locations, but src/agents and src/workflows are now the canonical layout.

Documentation and Parity Tracking

The release includes expanded documentation for agents, workflows, tools, subagents, models and providers, chat integrations, production guidance, observability, client usage, and Flue parity. A new parity reference page documents what PyFlue implements relative to the TypeScript Flue framework, what is intentionally not ported, and where Python-specific design choices differ. The main intentionally unported area is Cloudflare-specific runtime behavior, such as Durable Object backed execution recovery and Cloudflare-native runtime integration. PyFlue provides Python-native alternatives where practical, while keeping Cloudflare-specific internals out of the core Python runtime.

Breaking Changes

There are two important breaking changes in PyFlue 0.2.0.

Runs are now workflow-only. Direct and dispatched persistent agent prompts no longer create workflow runs or
surface run ids. They correlate by agent instance, session, and operation.
DeepAgents is no longer the default harness and is no longer installed as a core dependency. Projects using
DeepAgents should install the optional extra and select the backend explicitly.

pip install "pyflue[deepagents]"

agent = await init(harness="deepagents")

Installing PyFlue 0.2.0

Install with pip:

pip install pyflue

Or with uv:

uv add pyflue

Optional extras include:

  pip install "pyflue[deepagents]"
  pip install "pyflue[otel]"
  pip install "pyflue[sandboxes]"
  pip install "pyflue[monty]"

Or use equivalent uv commands to add it to your projects.

Looking Ahead

PyFlue 0.2.0 establishes the framework shape for Python agents: persistent agents, finite workflows, typed outputs, skills, tools, sandboxes, observability, and deployment paths. The next areas to watch are production durability for dispatched input, deeper provider and model routing, and continued parity review against the TypeScript Flue framework. For now, this release gives Python teams a stronger foundation for building agent systems that are structured, observable, and easier to deploy.

Learn more about PyFlue on GitHub, read the PyFlue documentation, or visit the Flue framework website.

Dynamic Workflows Vs Recursive Language Models (RLMs): Through Academic and Industry Lenses

Shashi Jagtap — Fri, 29 May 2026 20:19:08 GMT

Anthropic recently launched the Opus 4.8, the smartest model which has highly capable coding task and developers are loving. It’s coding capabilities. One of the feature they launch on top of it is the dynamic workflows. This feature is striking a debate on social media and all over the Internet that anthropic has copied the ideas from the academic paper RLM and implemented in their product. In this post, we will try to uncover was it true and what not from our angle? A detailed examination of the connections between Anthropic’s recent capabilities and foundational academic research in agentic artificial intelligence.

The artificial intelligence community continues to analyze Anthropic’s introduction of Dynamic Workflows in Claude Code, released as a research preview in May 2026. Developers have reported substantial successes using this feature on large-scale codebases through parallel subagent coordination and comprehensive end-to-end task execution. A central question in ongoing discussions concerns the relationship between Dynamic Workflows and Recursive Language Models, or RLMs, first proposed in academic research from October 2025. Professionals are evaluating whether this capability represents a practical realization of RLM principles in a production cloud environment.

This post provides a structured overview of the core concepts, their interconnections, perspectives from researchers and practitioners, early open-source contributions, and broader industry implications. It draws on recent statements to present a balanced and current assessment.

Understanding Recursive Language Models (RLMs)

Recursive Language Models were introduced in the October 2025 MIT CSAIL research by Alex Zhang and collaborators. This framework offers an inference strategy to overcome challenges in managing complex, long-horizon tasks with large language models. Conventional methods frequently attempt to incorporate all required information into a single extended prompt. Such approaches often lead to context degradation, in which model performance deteriorates as input length grows.

RLMs address this by reframing the interaction as an external REPL-style environment. The model decomposes problems programmatically, typically by generating code that enables recursive self-calls or tool invocations. Context is handled externally rather than being limited to the model’s immediate context window. This design supports scalable task decomposition while preserving efficiency and coherence. In principle, it allows for near-infinite effective context lengths by offloading information to external structures and permitting dynamic interaction with them.

Alex Zhang’s subsequent work on the Mismanaged Geniuses Hypothesis expands on these ideas. The hypothesis posits that frontier models hold considerable latent potential for specialized tasks but are often underutilized due to inadequate orchestration. Improved decomposition strategies, including those enabled by RLMs or orchestrator-subagent systems, present a promising direction. These approaches focus on training models for effective self-management, which may yield advances in length generalization, long-horizon reasoning, and handling out-of-distribution problems through structured composition rather than solely through increases in model scale.

Dynamic Workflows from Anthropic

Dynamic Workflows constitutes Anthropic’s progress in agentic capabilities within Claude Code. When a prompt includes terms such as “workflow,” the model generates an orchestration script dynamically. This script then coordinates a fleet of parallel subagents, executes tasks within the Claude Code environment, verifies outputs, and iterates as necessary to fulfill the overall goal. I tried the dynamic workflows myself and I was blown away because with one goal it’s on 80 working independently to one goal this is incredible for the developers where task has been automatically ended over by multiple agents at the same time. Dynamic workflow is incredible feature added by the cloud code team.

The feature has shown practical utility in demanding scenarios, including large-scale codebase migrations and intricate refactors. It delivers a seamless, cloud-native experience for managing extensive repositories while incorporating built-in verification. The system prioritizes reliability and scalability, enabling developers to address ambitious projects with less manual oversight. At the same time, observations indicate that it can be token-intensive when applied at scale.

Perspectives on the Relationship: Alignments and Distinctions

Researchers and practitioners have offered detailed analyses of how Dynamic Workflows relates to RLM concepts. Alex Zhang, lead author of the original RLM paper, has stated that Opus 4.8 combined with Dynamic Workflows in Claude Code represents perhaps the first instance of a frontier model being seriously trained toward RLM principles. He has noted that recent developments move the field closer to the RLM vision and has suggested that such capabilities could become the standard for nearly all coding agent interactions within a year. Zhang recommends consulting the RLM paper to appreciate the value of the underlying abstractions.

Omar Khattab, an MIT CSAIL researcher closely associated with RLM and related work, has provided specific criteria from the paper that align with the new feature. According to Khattab, an RLM requires two core elements: first, giving the underlying LLM a symbolic handle to the user prompt and the output stream; second, symbolic recursion over the prompt, which corresponds directly to what Anthropic refers to as “dynamic workflows.” Omar has remarked that Claude Code has effectively become an RLM with this release.

Many in the community characterize Dynamic Workflows as operationalizing RLM ideas in a production setting. Common descriptions include “RLM on agent harnesses” and recognition of a new scaling dimension that combines base model compute, inference-time thinking compute, and generated harness or orchestration compute. This perspective views the release as a meaningful advancement in agentic system design.

At the same time, nuanced distinctions have been highlighted. Some researchers argue that Dynamic Workflows does not fully satisfy a strict definition of RLM. For example, it may lack a fully persistent language REPL with programmatic context access beyond standard tool use. Instead, it generates an orchestration script followed by tool calls and text-based handoffs between agents, rather than direct recursive self-calls with clear return semantics in a REPL sandbox. One analysis scores the match at roughly one-third based on core criteria such as programmatic context access and persistent REPL semantics.

Additional concerns focus on recursion control. When the model itself determines when to stop recursion without external ceilings on cost or time, it may introduce new versions of longstanding challenges rather than resolving them. The original RLM paper described recursion at a depth of one, whereas Dynamic Workflows extends further, including greater control over model weights. These viewpoints position Dynamic Workflows as a valuable evolution of sub-agent orchestration rather than an exact replica, while still recognizing its conceptual overlap and practical strengths.

Timeline, Early Implementations, and Agentnetes by Superagentic AI

The progression from theory to implementation has been notably rapid. The October 2025 RLM paper was followed in December 2025 by open-source projects emerging from the Superagentic AI Vercel x DeepMind “Zero to Agent” hackathon in London. One such example is Agentnetes from Superagentic AI, aframework for self-organizing agent swarms.

Drawing from its documentation, Agentnetes applies RLM-inspired recursive decomposition by transforming a single high-level goal into an emergent team of specialist agents in isolated sandboxes. It features a root “Tech Lead” agent that explores via external tools, along with tight loops of search, analysis, planning, execution, and verification. Agents maintain minimal token footprints using basic tools and support parallelism, inter-agent collaboration, and self-healing mechanisms. This and similar early efforts demonstrated recursive, dynamic multi-agent approaches in accessible developer tools well before broader industry rollout.

Industry Patterns and the Academic-to-Product Gap

These developments illustrate a recurring pattern in artificial intelligence. Major laboratories often incorporate concepts from institutions such as MIT, Berkeley, and Stanford with a lag of approximately six to nine months. Ideas involving advanced decomposition, external context management, and dynamic multi-agent orchestration frequently originate in research papers and collaborative events before being refined into commercial products, sometimes without direct attribution to original sources.

Anthropic’s Dynamic Workflows exemplifies the effective translation of academic and early open-source insights into a scalable cloud solution. While this delivers immediate benefits to users, it also highlights the importance of the wider ecosystem of researchers and independent developers who identify and prototype foundational concepts.

Discussions around the Mismanaged Geniuses Hypothesis reinforce that meaningful progress may depend on enabling models to perform effective self-management and decomposition within well-designed scaffolds.

Is Dynamic Workflow is Really RLM in the Cloud?

This is the time to add my own hot takes on this debate, Anthropic has implemented a RLM kind of approach in the previous version when they launched the managed agent and since then they clearly indicated that they are using the concepts from RLM in their workflows and from that point it’s clear that they are investigating into the research and putting a lot of investment on the academic research after that they might have got an idea about the other alarms and the sandboxing env and that would have come to in the picture for the dynamic workplace so the ideas are similar but not exactly the same by looking at the RLM paper, it stops at depth of 1 in recursion, however Anthropic went well beyond and had a recursive agent spawning multiple of sub agents which extended the RLM concept having said that when the ideas has been implemented it is not exactly as RLMs so whatever it is Anthropic has innovated a lot on this feature and now become industry standard, On other side there are other coding agent like Codex also started investigating this kind of ideas as we noticed that in the recent Codex versions, it writing in the python code rather than running the shell commands that is the idea of from RLMs but they have not innovated that level as Anthropic. Codex team could have innovated way better because they have the same tech stack using python and their harness is well optimise the python they could have implemented this way earlier however the Claude Code being the TypeScript first framework they have ported this idea of RLMs into their product and innovation. I hope the Codex team will catch up on this one sooner than later. In the end in my opinion, concept of the RLM and Dynamic Workflow are close and very similar but differ in the implementation. In the end, execution matters.

Conclusion

The relationship between Dynamic Workflows and Recursive Language Models continues to generate constructive dialogue, with direct contributions from key researchers such as Alex Zhang and Omar Khattab. While alignments are evident and substantial, distinctions in implementation details remain subjects of thoughtful analysis. Regardless of final classification, the release clearly validates recursive and dynamic multi-agent methodologies as essential to the advancement of agentic artificial intelligence.

From academic foundations in RLM research, through pioneering open-source projects such as Agentnetes, to refined cloud implementations, the field is advancing with notable speed. This era demonstrates how academic innovation can translate rapidly into accessible and powerful tools. The central insight is that future breakthroughs will rely as much on sophisticated orchestration as on raw model intelligence.

Practitioners are encouraged to explore both the original research and current implementations, contributing to the continued refinement of agentic systems. Thoughtful perspectives on these developments are welcome.

CodexOpt Brings Microsoft SkillOpt to Codex: Optimizing Agent Skills with Execution Feedback

Shashi Jagtap — Tue, 26 May 2026 06:36:27 GMT

Microsoft Research released the SkillOpt paper. The work has generated considerable discussion across the AI community. Researchers and engineers highlight its disciplined approach to improving agent capabilities without modifying model weights.

Codex users already depend on AGENTS.md and SKILL.md files as active components of runtime behavior in OpenAI’s Codex harness. SkillOpt offers a structured method to optimize these artifacts through execution feedback, bounded edits, and validation gating. This approach transforms intuitive prompt adjustments into measurable, reproducible gains.

With CodexOpt 0.2.0, we have integrated these concepts into a practical CLI workflow designed specifically for Codex users.

Community Reactions on X

Discussions on social media reflect strong interest in SkillOpt. Many posts emphasize its shift from hand-crafted prompts to systematic, evidence-based skill evolution. Key points frequently mentioned include:

SkillOpt treats skill documents as trainable external state for frozen models, applying optimizer-like discipline (bounded add/delete/replace edits, textual learning rates, and strict validation gates).
Strong empirical results: best or tied for best across all 52 evaluated settings (models × benchmarks × harnesses). Notable gains on GPT-5.5 include +23.5 points in direct chat, +24.8 points in Codex harness, and +19.1 points in Claude Code.
Practical advantages: zero additional inference cost at runtime, strong transferability across models and harnesses, and skills that remain compact and human-readable.
Comparisons frequently note that SkillOpt outperforms baselines such as human-written skills, TextGrad, GEPA, and EvoSkill.

Some voices raise longer-term questions about reliance on static optimized skills as dynamic reasoning improves, but the prevailing sentiment views this as a valuable step toward more reliable agent engineering today.

What SkillOpt Delivers

SkillOpt frames natural-language skill documents as optimizable external state. An optimizer analyzes rollout trajectories, proposes controlled edits, and accepts changes only when they improve performance on held-out validation tasks.

Key strengths:

Rigorous validation that prevents unproven or bloating changes.
Exported skills are plain text files with no runtime overhead.
Demonstrated improvements on benchmarks such as Spreadsheet solving (41.8 percent to 80.7 percent) and Office QA.

Why Codex Is an Ideal Target

OpenAI’s Codex harness incorporates instruction files directly into the agent loop. This produces observable trajectories that provide rich feedback for optimization.

CodexOpt treats Codex runs as rollouts:

Deploy a candidate skill.
Execute tasks through codex exec.
Capture JSON event streams and outcomes.
Score behavior using verifiers, LLM judges, or static analysis.
Generate bounded rewrites.
Validate on held-out tasks before acceptance.

What’s New in CodexOpt 0.2.0

uv run codexopt improve # Safe preview mode
uv run codexopt improve --live # Full optimization with Codex
uv run codexopt improve --live --apply # Apply validated changes

Core capabilities:

Automatic train/validation task splits mined from git history, issues, and skill descriptions.
Bounded edits with textual controls.
Validation-gated acceptance.
Multiple reward signals including verifier outcomes and LLM judge feedback.
Full Codex JSONL trajectory support.
Detailed reports showing accepted diffs and performance changes.

SkillOpt Mapping to CodexOpt

Skill artifact: SKILL.md or AGENTS.md
Rollout: codex exec or command verifier
Feedback: Trajectory analysis + multi-signal scoring
Bounded edit: Edit budget + controlled modifications
Validation gate: Held-out task performance
Exported skill: Validated file diff with backups

GEPA Influence and CodexOpt Approach

The SkillOpt paper evaluates against GEPA and similar methods. CodexOpt incorporates useful elements of textual reflection while delivering a streamlined, Codex-native implementation. The previous GEPA engine path has been deprecated in favor of the maintained reflective engine.

Practical Workflow

Preview changes safely with uv run codexopt improve.
Run live optimization with uv run codexopt improve --live.
Review results using uv run codexopt report.
Apply validated edits with --apply.

Enhanced task evidence in tasks.md or JSON format strengthens optimization signals, from simple descriptions to full Codex rollout tests.

What Community saying?

Community conversations on X highlight a shared challenge: manual skill editing often results in inconsistency or prompt bloat. SkillOpt and tools like CodexOpt establish a higher standard where skills must demonstrate value through measurable task improvements.

Optimized skills become reliable, transferable artifacts that reduce agent errors and improve workflow consistency.

Install and Get Started

pip install codexopt==0.2.0

PyPI release: https://pypi.org/project/codexopt/0.2.0/
SkillOpt Paper: https://arxiv.org/abs/2605.23904
SkillOpt Project Page: https://microsoft.github.io/SkillOpt/

Closing

Agent skills have evolved from static notes into operational, optimizable components. CodexOpt 0.2.0 makes SkillOpt-style optimization practical for Codex users by combining rigorous validation with direct harness integration.

Evidence-driven improvement provides a clearer path forward than intuition alone. Start optimizing your skills today.

Grok Build Enters the Agentic Coding Arena with Official Grok CLI: Game ON

Shashi Jagtap — Sat, 16 May 2026 08:09:08 GMT

xAI has officially entered the agentic coding arena. On May 14, 2026, the company launched an early beta of Grok Build, its native agentic command-line interface (CLI). In the launch blog post, they have noted some great features designed for coding, building applications, and automating complex workflows directly in the terminal, Grok CLIrepresents a significant advancement for the Grok ecosystem.

What Is New in Grok Build

As per the launch page, Grok Build delivers a fast, flicker-free terminal experience built specifically for multi-agent coordination. Key new capabilities include:

Skills and customization: Grok Build adapts to individual workflows and preferences through customizable skills.
Plan viewer: A dedicated interface makes it straightforward to architect and review complex projects.
Plugins and marketplaces: Users can extend functionality with plugins, hooks, and shared capabilities across teams via marketplaces.
Intelligent Q&A: The system asks targeted questions to clarify requirements and refine outcomes.
Parallel subagents: Multiple subagents operate simultaneously for research, building, and code review, significantly accelerating development.

These features create a cohesive, terminal-native environment that supports plans, subagents, and parallel execution. Installation is simple for eligible users, and the tool emphasizes seamless integration into daily developer routines. For full details and to try it, visit the official Grok Build page.

Grok Build positions itself as a direct competitor to tools such as Codex and Claude. While traditional assistants focus on code completion within editors or browsers, Grok Build offers end-to-end agentic workflows entirely within the command line.

Understanding the Grok Tooling Landscape

Grok Code Models refers to the model released earlier agentic coding capabilities that many developers used when freely available.
Community Grok CLI projects leverage the public xAI API for accessible alternatives.
Grok Build is the official, polished xAI implementation now available exclusively to SuperGrok Heavy subscribers.

Public discussions on social media show strong enthusiasm for the product’s technical capabilities and terminal-first design, tempered by questions about accessibility during the beta phase.

My Experience with Grok Code

I used Grok Code models with harnesses like Cursor, OpenCode extensively when it was free and essentially unlimited. The tool proved transformative and at Superagentic AI, I built many tools with it. during October 2025 till January 2026. It moved beyond simple suggestions to deliver genuine agency. It analyzed projects, developed structured plans, created and modified files, executed tests, and iterated based on feedback. I completed prototypes in hours that traditionally required days of manual effort.

This power came with important responsibilities. Grok Code sometimes executed commands or made file changes with minimal prompting. I occasionally needed to step in to prevent unintended modifications. These experiences reinforced a critical principle: powerful agentic models require disciplined oversight. Always review proposed actions, use isolated environments for testing, and avoid granting unrestricted access to production systems or sensitive information. This security-first approach remains essential with Grok Build.

Comparison with Codex and Claude

Initially, I have relied on both Claude Code and OpenAI’s Codex for the majority of my development work. In recent months, Codex has become my default choice for Python-based projects. Its deep optimization for Python, further strengthened by OpenAI’s acquisition of Astral and integration of high-performance tools such as uv, Ruff, and ty, delivers exceptional efficiency in Python environments. This combination has made Codex the most reliable option for my typical Python workflows.

Grok Build now prompts serious consideration of a switch. Given my consistently positive past experiences with Grok models and their strong agentic capabilities, I am optimistic that the official CLI will offer comparable or superior optimization for Python projects. The terminal-native multi-agent design appears particularly promising for complex, iterative Python development. If Grok Build demonstrates the expected level of Python performance, I plan to adopt it as my primary tool once I upgrade to the required plan. The combination of raw agentic power, terminal integration, and responsible security practices could make it a compelling long-term choice.

Is SuperGrok Heavy Worth the Investment?

SuperGrok Heavy, required for Grok Build access, carries a price point of approximately $300 per month after an introductory period. For power users who ship production code daily and prioritize terminal-native multi-agent orchestration, the productivity gains may justify the cost. Reduced context switching and accelerated workflows offer clear value in high-volume development environments. The introductory pricing provides a lower-risk opportunity to evaluate the tool during beta. For most individual developers and smaller teams, the full subscription may not be immediately necessary. Open-source Grok CLI tools connected to the standard xAI API deliver substantial agentic functionality at pay-per-use rates. While these alternatives may lack some official refinements, they support effective daily work. My prior success with Grok Code confirms the models’ strength in agentic tasks when applied responsibly. I intend to test the introductory offer myself.

Looking Ahead

Grok Build intensifies competition among leading agentic coding solutions. xAI has shown that its models support sophisticated, production-oriented agentic work. Future progress will hinge on rapid iteration based on user feedback, broader access options, and continued safety improvements. As Elon Musk recently noted about the current beta, “Go in with expectations that Grok Build is still beta, but improving almost every day.” Developers who work extensively in the terminal should explore Grok Build. Visit x.ai/cli for installation instructions and details on the introductory offer. Those not prepared to subscribe can begin with API-based open-source alternatives.

The age of truly agentic coding tools has arrived, and Grok Build is now a capable contender in this evolving space.The Grok CLI officially entered in the Agentic coding race and warning sign for the Claude Code, Codex and others building the coding agents. GAME ON!

Learning, Fast and Slow: What’s Next in LLM Fine-Tuning and Plastic Continual Learning with GEPA

Shashi Jagtap — Wed, 13 May 2026 07:28:44 GMT

Fine-tuning is taking new shape in the recent days as OpenAI decided to wind down the fine tuning service as well as few other claimed its end of fine tuning era. Large language models are powerful tools. Yet adapting them to new tasks remains difficult. Most current methods update the model’s internal parameters to improve performance on a specific task. While effective at first, this approach often causes the model to forget previous skills, a problem known as catastrophic forgetting. It can also reduce the model’s ability to learn new tasks in the future.

A new paper titled “Learning, Fast and Slow: Towards LLMs That Adapt Continually“ presents a promising solution. Early reactions on social media have noted that the paper “splits learning into slow weights and fast, feedback-driven context, yielding big gains in sample efficiency and much less forgetting.” The paper is written by researchers including Lakshya A. Agrawal, one of the key authors behind the earlier GEPA method. The full paper is available at https://arxiv.org/pdf/2605.12484v1.

The Idea: Fine-Tuning Should Use Two Brains

The authors propose using two complementary ways to adapt large language models at the same time. The first is slow learning through updates to the model parameters. These changes are powerful and lasting but expensive and can reduce the model’s flexibility. The second is fast learning through optimized prompts and textual context. These changes are quick, low cost, and fully reversible.

Instead of choosing only one approach, the method lets both work together. Fast prompts capture task specific details while the slow parameter updates focus on deeper, more general improvements. This idea is inspired by earlier concepts in neural networks that separate temporary adaptations from long term knowledge.

How the Method Works

The approach combines reinforcement learning for parameter updates with GEPA, a reflective evolutionary prompt optimization technique. It runs in repeating cycles. In each cycle, GEPA first evolves a small set of effective prompts using the current model. Then the method performs several steps of reinforcement learning while using these improved prompts as additional context. This loop allows the fast prompts to absorb recent task specific lessons quickly. As a result, the model parameters do not need to store every detail. The core model stays closer to its original capabilities and retains more flexibility for future learning.

Key Benefits

The paper highlights several important advantages of this dual channel method compared to standard reinforcement learning approaches alone.

Improves data efficiency, reaching strong performance with fewer training steps.
Achieves a higher final performance level across the tested tasks.
Updated model stays closer to the original base model, which helps preserve existing knowledge.
Maintains better plasticity, meaning the model can still learn new tasks effectively after training on a previous one.
Supports true continual learning, allowing the model to keep improving when task domains change over time.

These benefits are especially relevant for building AI systems that need to evolve with changing requirements.

Limitations

The experiments use 8B scale models and focus on specific reasoning tasks including code output prediction, math reasoning, and multi hop fact verification. Results may not extend directly to much larger models or other domains. The prompt evolution step adds some extra computation during training. The authors present the method as a useful extension to existing reinforcement learning pipelines rather than a replacement for all post training approaches. Further research is needed to explore its scaling and broader applicability.

Timely Context: Recent Developments in Fine-Tuning

Recent discussions in Fine Tuning reflect a changing landscape for LLM fine-tuning. OpenAI is winding down its self-serve fine-tuning platform, with new training jobs restricted and the service ending for new use by January 2027. Many in the community see this as a move toward greater reliance on prompt engineering, context management, and RAG. At the same time, there is increased interest in open-weight models and custom post-training, especially for domain-specific agents and long-horizon tasks.

The dual-channel approach in this paper offers a practical path for efficient and continual adaptation in this new environment.

Next Frontier in Fine Tuning

For AI and machine learning practitioners and software engineers working with large language models, this paper encourages a shift in how we think about fine tuning. By letting optimized prompts share the adaptation load, it becomes possible to train more efficiently while keeping the model adaptable over time. This dual channel perspective could lead to lower training costs, more reliable systems, and large language models that continue to improve without the common drawbacks of forgetting or rigidity.

The complete paper, including all methods and supporting details, is available here: https://arxiv.org/pdf/2605.12484v1.

This research points toward a practical path for developing large language models that are not only strong but also genuinely adaptable throughout their lifetime. Looking forward to use this in some of the use cases if applicable but for now enjoy the paper from great authors.

The Rise of Agent Harness Frameworks and Harness As a Service (HaaS)

Shashi Jagtap — Sun, 10 May 2026 11:10:47 GMT

The rapid maturation of agentic AI in 2026 has shifted recently to the models to the scaffold around it. While model capabilities continue to advance, the decisive factor in building reliable, production-ready agents lies in the systems surrounding them. A capable model paired with a well-designed harness consistently outperforms a stronger model operating with inadequate scaffolding.

This insight has crystallized into a distinct engineering discipline: harness engineering. Recent contributions from industry leaders, including Addy Osmani’s comprehensive overview, and foundational work by Vivek Trivedy at LangChain, have clarified the principles and practices driving this evolution. At Superagentic AI, our ongoing research and open-source efforts in this area, including the development of PyFlue, inspired by Flue, reinforce a core conviction: the harness is no longer an afterthought. It has become the primary source of differentiation and reliability for autonomous agents.

Understanding the Agent Harness

An agent is not simply a large language model. It is a model integrated with a comprehensive runtime environment that enables purposeful action. As articulated across multiple independent blogs for industry leaders, the equation is: Agent = Model + Harness. The harness encompasses all components beyond the model itself system prompts and rule files such as AGENTS.md, tools and skills with their descriptions, orchestration logic for subagents, hooks and middleware for enforcement, sandboxes and execution environments, observability and tracing systems, memory and context management mechanisms, and recovery pathways.

This scaffolding transforms raw generative capabilities into structured, verifiable workflows. A raw model generates text. A harness equips it with state persistence, tool execution, self-correction loops, and safety boundaries so it can complete complex, multi-step tasks reliably.

The harness operates on a ratchet principle. Each observed failure leads to a permanent improvement: a new rule in the system prompt, a blocking hook before destructive commands, refined context management to combat degradation, or a split between planning and execution subagents. These adjustments accumulate, making the agent progressively more aligned with the specific demands of its environment and use case.

Failures previously attributed to model limitations often prove to be configuration opportunities. Benchmarks frequently demonstrate that the same model achieves markedly different results depending on the quality of its harness. The performance gap is not merely incremental; it frequently determines whether an agent delivers production value or remains a research prototype.

Towards Harness Engineering

Early agent development emphasized prompt design. Practitioners quickly discovered its limitations for sustained, real-world performance. Context windows fill and degrade, tools require careful integration, long-horizon tasks demand decomposition and verification, and safety concerns multiply in open environments.

Before Harness Engineering even talked about publicly, Superagentic AI released a paper SuperOpt Agentic Environment Optimization AI Agents with idea of treating entire Agent as optimisation target rather than optimizing each agent component individually, this ideas is now turning mainstream into Harness Engineering.

Harness engineering addresses these challenges systematically. It treats the runtime as a first-class, evolvable artifact rather than ad-hoc scripts. Key primitives include:

Durable state through filesystems and version control, enabling agents to read, write, experiment, and roll back safely.
General-purpose tooling, often via secure bash or code execution, combined with domain-specific skills.
Sandboxes that isolate execution while providing rich defaults for testing and observation.
Memory layers that persist lessons across sessions via structured files and searchable stores.
Context management techniques such as compaction, offloading, and progressive disclosure to maintain reasoning quality.
Enforcement hooks that run deterministically at critical points in the execution cycle.
Observability and self-optimization loops that feed traces back into harness refinements.

These elements converge in mature implementations, from specialized coding agents to broader autonomous systems. The discipline draws on systems engineering traditions while adapting them to the probabilistic nature of large models.

The Emergence of Agent Harness Frameworks

Building effective harnesses from scratch for every project has become impractical. Frameworks now provide modular, battle-tested foundations so developers can focus on domain logic, custom tools, and business-specific policies.

Flue exemplifies this trend with its programmable, low-boilerplate approach in TypeScript. It offers a clean harness model that supports autonomous workflows while maintaining developer control over the full stack.

Our contribution at Superagentic AI is PyFlue, a Python-native port designed for the AI and machine learning ecosystem. PyFlue brings Markdown-driven skills, persistent sessions, policy-gated sandboxing, typed outputs, streaming events, and pluggable backends (including DeepAgents and others). Recent updates have added structured command handling, improved cancellation support, and enhanced client-server modes. It enables teams to adopt proven harness patterns without leaving their preferred language and tooling environment.

These frameworks share common strengths: opinionated yet extensible primitives, emphasis on observability, support for self-correction, and pathways to deployment. They accelerate iteration while promoting consistency. Ablation experiments and production case studies show that gains from refined orchestration, memory, and middleware often exceed those from model upgrades alone.

The community conversation, amplified by recent public discussions, highlights convergence around core patterns even as implementations vary. This maturation signals that harness frameworks are transitioning from experimental tools to foundational infrastructure.

Harness As a Service: Managed Runtimes for Agents

A parallel evolution is underway toward Harness as a Service (HaaS). Instead of assembling orchestration, tool integration, context handling, and safety layers manually, teams configure high-level runtimes that provide these capabilities out of the box.

Major providers are moving from simple completion APIs to full agent runtimes. These services handle loop management, sandboxing, observability, and basic recovery, allowing developers to concentrate on prompt strategy, tool definitions, and evaluation criteria. The shift mirrors the broader transition from infrastructure management to platform consumption, but tailored to agentic workloads.

Benefits include improved scalability, standardized observability, easier multi-agent coordination, and reduced operational burden. Challenges remain—particularly around customization depth, multi-tenancy security, and cost predictability in long-running scenarios. Nevertheless, HaaS represents a logical progression as harness patterns stabilize.

Challenges and Future Directions

Harness engineering is not without open questions. As models improve, harness needs do not vanish; they migrate upward to address more sophisticated failure modes and higher-ambition tasks. Judgment-oriented decisions, multi-agent orchestration protocols, enterprise-grade security and auditing, and dynamic tool assembly all require continued innovation.

Self-optimizing harnesses that analyze their own traces to propose or apply refinements represent a promising frontier. Interoperability standards between harnesses will become increasingly important for fleet-level agent systems. The balance between standardization and specialization will define competitive advantages.

Ultimately, harnesses function as adaptable compilers for agent behavior, translating high-level goals into reliable execution while encoding hard-won lessons from real deployments.

Conclusion

The rise of agent harness frameworks and Harness as a Service marks a pivotal maturation in agentic AI. Differentiation now stems less from raw model selection and more from the quality, adaptability, and insight embedded in the surrounding systems. Organizations that invest thoughtfully in harness engineering will deliver more reliable, maintainable, and powerful autonomous capabilities.

We continue to advance this field through open-source projects such as PyFlue and Meta-Harness initiatives, while fostering community dialogue. If you are exploring these ideas in practice, we invite you to join the conversation.

Upcoming Event on Harness Engineering

We are hosting a dedicated Harness Engineering meetup on June 29 at the AWS Builder Loft in San Francisco. This gathering will feature discussions, hands-on sessions, and practitioner insights on building production-grade agent systems. Register here.

We look forward to advancing this discipline together and welcome contributions, feedback, and collaboration from the broader community.

London Agentic AI x Tessl: Community Partner for AI Native DevCon London 2026

Shashi Jagtap — Fri, 08 May 2026 08:45:16 GMT

Excited to announce that London Agentic AI is officially joining AI Native DevCon London 2026 as a Community Partner.

This partnership is personally meaningful to me because I have been interacting with the Tessl team since they opened the Tech Space in London. I attended the opening ceremony and have stayed connected with the team since then through community events and technical meetups. I have been thrilled to work with Sam and the Tessl team from those early days, and I cannot say enough about how supportive they have been since day one. From helping the community, to offering the venue at short notice, to providing additional support for events we organised, the support from Tessl has always been appreciated.

The First London Agentic AI Event at Tessl

In October 2025, London Agentic AI hosted its first event at Tessl focused on Agentic AI and MCP. The event featured Macey Baker from Tessl together with Guillaume Lebedel, and more than 170 attendees joined the event in person. Event link:

London Agentic AI + MCP Meetup

The energy in the room that evening showed the growing interest in Agentic AI, MCP, and AI engineering.

Continuing the Collaboration in 2026

London Agentic AI also hosted another event at Tessl in March 2026 focused on the Agent Client Protocol ecosystem. The event was sponsored by JetBrains and included speakers from:

JetBrains
Zed
Mistral AI
Vibe Kanban

Event link: Agent Client Protocol Event

It was another strong technical gathering that brought together developers, founders, engineers, and AI practitioners interested in the future of AI-native development and agent ecosystems.

Tessl Becoming a Technical Hub in London

Over the past year, I have personally attended several events hosted at Tessl, and it has been exciting to watch the venue become one of the active hubs for deep technical events in London. The space has hosted a broad range of meetups and industry discussions, bringing together developers, engineers, founders, and experts from across the AI industry. If you are part of the London tech ecosystem, Tessl has become one of the venues that is difficult to ignore for technical AI events and community discussions.

AI Native DevCon London 2026

Now Tessl is hosting AI Native DevCon London 2026, and London Agentic AI is proud to support the conference as a Community Partner.

The event will take place on June 1 to 2, 2026 in London and virtually.The conference themes include some great cutting topics. The conference website also highlights speakers and participants from organisations including OpenAI and Meta, alongside other companies working in AI and AI-native software development.

The event is taking place at The Brewery in London. If you have not signed up yet, I would strongly encourage you to check out the conference and attend. It is shaping up to be an exciting gathering for developers, AI engineers, founders, and practitioners working on AI-native systems and agentic technologies.

Conference link: AI Native DevCon London 2026 Tessl also providing discount for the members for the community. Get in touch.Looking forward to seeing the community there.

JOIN US

Conference link: AI Native DevCon London 2026
Join London Agentic AI Community on Luma and Meetup

Whats New in PyFlue v0.1.3?

Shashi Jagtap — Wed, 06 May 2026 08:20:33 GMT

Thrilled to announce the release of PyFlue v0.1.3. This major update introduces significant enhancements to the framework. PyFlue is the official Python port of the recently launched Flue framework (flueframework.com), which has quickly gained strong interest in the AI agent community for its innovative agent harness approach. PyFlue brings these capabilities to Python developers, enabling the creation of secure, autonomous, and deployable agents with minimal code.

This release adds powerful support for client and server architectures, MCP integration, sandboxed filesystem operations, structured commands, session cancellation, and much more. Below is an in-depth exploration of each new feature, with concise code examples to help you begin using them immediately.

DeepAgents Backend Enhancements

The DeepAgents backend now includes sandbox-backed filesystem tools for secure and isolated file operations. Task delegation supports sophisticated workflow orchestration, and streaming tool events deliver real-time feedback. Provider settings offer finer control over sandbox configuration, while scoped working directories ensure isolation between sessions.

You can configure provider-specific settings directly in your code:

from pyflue import PyFlueAgent

agent = await PyFlueAgent.init(
    model="openai:gpt-4o",
    harness="deepagents",
    sandbox="daytona",
    providers={
        "daytona": {
            "api_key": "your-api-key",
            "base_url": "https://api.daytona.io"
        },
        "e2b": {
            "api_key": "your-e2b-key"
        }
    }
)

Expanded Sandbox Filesystem API

Version 0.1.3 adds a comprehensive set of sandbox filesystem APIs. These include metadata retrieval, existence checks, directory creation and removal, and binary-safe read and write operations. Developers can now build robust tools that interact with the filesystem in a secure, controlled manner.

from pyflue import PyFlueAgent, Sandbox

agent = await PyFlueAgent.init(sandbox="virtual")
sandbox = await agent.create_sandbox()

exists = await sandbox.exists("config.json")
metadata = await sandbox.stat("data.bin")
data = await sandbox.read_binary("image.png")
await sandbox.write_binary("output.bin", b"\x00\x01\x02\x03")
await sandbox.mkdir("data/output")
await sandbox.rm("temp/cache")
await sandbox.rmdir("data/old")

Runtime Context Discovery

PyFlue automatically detects and loads contextual information from AGENTS.md, CLAUDE.md, and local skills files in the active sandbox environment. Agents can now adapt dynamically to their surroundings.

from pyflue import PyFlueAgent, load_project_instructions

instructions = await load_project_instructions(root_dir="/path/to/project")

agent = await PyFlueAgent.init(skills_dir="./.agents/skills")

Directory-Style Skill Support

Skills can now be organized in a directory structure under .agents/skills with relative file lookup. This change simplifies management of complex skill collections.

.agents/
├── skills/
│   ├── coding/
│   │   ├── code_review.md
│   │   └── refactor.md
│   ├── docs/
│   │   ├── write.md
│   │   └── update.md
│   └── data/
│       ├── analyze.md
│       └── visualize.md
└── AGENTS.md

Typed HTTP Error Envelopes and Webhook Validation

Typed HTTP error envelopes provide structured error responses for HTTP integrations. Webhook request validation has been strengthened for improved reliability and security.

Structured Session History

Session history now includes compaction records, task metadata, child task tracking, and recursive cleanup. These features deliver better visibility and efficient resource management.

Automatic Token-Based Compaction

Automatic compaction triggers before long prompt turns, with a built-in context-overflow recovery retry. Context window limitations are managed gracefully.

from pyflue import PyFlueAgent

agent = await PyFlueAgent.init(
    model="openai:gpt-4o",
    compaction_enabled=True,
    compaction_context_window_tokens=6000,
    compaction_reserve_tokens=500,
    compaction_keep_recent_tokens=2000
)

result = await agent.prompt("Analyze this large codebase and provide insights")

MCP Direct Mode and Search/Execute Mode

MCP direct mode enables streamlined integration with the Model Context Protocol. Search/execute mode adds configurable server loading and tool search capabilities, expanding interoperability.

from pyflue import PyFlueAgent
from pyflue.mcp import McpServerConfig, McpMode

# Direct mode
agent = await PyFlueAgent.init(
    mcp_servers={
        "filesystem": {
            "command": "npx",
            "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
        }
    },
    mcp_mode="direct"
)

# Search/execute mode
agent = await PyFlueAgent.init(
    mcp_servers={"github": {"url": "https://mcp.github.com"}},
    mcp_mode="search_execute",
    mcp_search_limit=10,
    mcp_search_backend="bm25"
)

Agent-Wide Tools and Commands

Agent-wide tools and command grants can now be shared across agents. The PyFlueCommand class and define_command() function simplify creation of reusable shell or callable tools.

from pyflue.types import PyFlueCommand, define_command

deploy_cmd = define_command(
    "deploy",
    "bash scripts/deploy.sh",
    description="Deploy to production",
    cwd="/app",
    env={"ENV": "production"},
    timeout=300
)

agent = await PyFlueAgent.init(commands=[deploy_cmd])

Session Abort and Cancellation

The new session.abort() method includes active operation tracking and cancellation events. Max task depth limits and parent-to-child cancellation propagation provide fine-grained control.

Richer Typed Result Extraction

Support has been added for extracting typed results from delimited JSON, raw JSON, fenced JSON, and embedded JSON within free-form text.

PyFlueClient for Deployed Server Usage

The new PyFlueClient class supports health checks, agent listing, prompt operations, typed parsing, route calls, and SSE streaming for robust client-server architectures.

Build Plugins and Deployment Targets

Build plugins are now available for Uvicorn, Docker, Lambda, Cloud Run, and Cloudflare Containers. These tools simplify deployment across platforms.

pyflue build uvicorn --output dist/
pyflue build docker --tag myapp:latest
pyflue build lambda --handler app.handler

CLI Enhancements

New CLI commands include pyflue routes, improved pyflue dev with .env loading, pyflue invoke, and pyflue status. Truncation messages for large outputs have also been refined.

Documentation and Examples

Comprehensive documentation now covers MCP, configuration, built-in tools, structured commands, cancellation, client usage, deployment, and session behavior. A model-free server and client smoke example is included in examples/server_client/.

All Features of PyFlue

PyFlue is a powerful Python-first framework for building intelligent agents. It provides a complete set of capabilities for sophisticated AI-powered applications.

Core Capabilities

DeepAgents backend with seamless model provider integration
Markdown skill loader for flexible agent definition
SQLite-based session management for persistent history
Virtual sandbox for secure execution
Pydantic typed outputs for type-safe data handling
Typer CLI for intuitive command-line interaction

Integration Features

Multiple sandbox providers (Daytona, E2B, Modal, Runloop)
Optional Monty Python backend for host-side execution
Normalized streaming events with CLI and SSE support
Markdown roles and route triggers
Secret grants with strict command policies
Per-session virtual sandbox persistence

Deployment Options

First-class support for Uvicorn, Docker, Lambda, Cloud Run, and Cloudflare Containers. Build plugins handle packaging and configuration for production environments.

Developer Experience

Full documentation, product-oriented example agents (issue triage, data analysis, coding, support automation), and seamless Model Context Protocol integration.

This release marks a significant milestone, bringing PyFlue to full parity with the Flue vision while adding Python-native strengths. Early community feedback on X highlights strong interest in the Python port, particularly among developers seeking Flue’s harness model without leaving the Python ecosystem.

We encourage you to explore the new features, experiment with the examples, and share your feedback. The repository, documentation, and quickstart guides are available at the links below:

GitHub: https://github.com/SuperagenticAI/pyflue
Documentation: https://superagenticai.github.io/pyflue/
Website: https://super-agentic.ai/pyflue

Thank you for being part of the PyFlue journey. We look forward to seeing what you build.

Introducing PyFlue: The Python-Native Agent Harness Framework Inspired by Flue.

Shashi Jagtap — Sun, 03 May 2026 14:44:43 GMT

The CEO of HTML, Fred Schott released Flue , the TypeScript community quickly recognized its significance. A true agent harness framework with Markdown-driven skills, headless and programmable design, zero-config sandboxing, and seamless deployment felt like the missing piece for building autonomous agents.

Python has already strong ecosystem of the even better but everything is DIY. Flue just gave the better DX and Philosophy and baked entire harness as service for agent builders. Its mainly built for the TypeScript ecosystem but ideas is strong. Usually, TypeScript ecosystem port all the major AI/ML libraries but it’s a time to port the great idea coming from Typescript into Python. So, here is PyFlue

PyFlue is now available as the complete Python-native port of Flue. It delivers the same low-boilerplate, Markdown-first developer experience while leveraging Python’s mature ecosystem for typed outputs, secure sandboxes, and flexible deployment. It also adds several production-ready capabilities that go beyond the original vision.

Harness Moment

Flue is not another AI SDK or chat wrapper. It is a framework built around a built-in agent harness. As Fred explained in the launch thread, the harness is the defining characteristic that turns a simple script or chatbot into a true autonomous agent. The many people from the LangGraph and other communities welcomed this idea and appreciated the efforts.

TypeScript developers are excited because Flue feels like Claude Code, autonomous, low-code, and powerful but without the TUI or GUI assumptions. It is headless, programmable, and deployable anywhere. Most logic lives in Markdown files (skills, roles, and AGENTS.md). The framework handles sessions, sub-agents, sandboxing, and deployment so developers can focus on outcomes instead of wiring graphs or loops.

How Flue Differs from Existing Python Tools

The Python ecosystem already has strong agent primitives. LangGraph provides powerful stateful graphs and checkpointers. DeepAgents (LangChain’s higher-level harness) adds planning, filesystem backends, and sub-agents on top of LangGraph. Other options such as CrewAI, PydanticAI, OpenAI Agents SDK, and Google ADK offer structured tools, multi-agent orchestration, and typed outputs.

Flue is different in three key ways that make the framework moment feel fresh:

Framework with strong conventions instead of a flexible library or SDK. Flue ships a complete runtime with first-class concepts such as session, skill, subagent, and built-in sandbox. You write minimal code and run flue runor flue build. LangGraph and even DeepAgents require more assembly to achieve the same harness. Flue is closer to Astro or Next.js than to a React component library.
Markdown-first developer experience with zero-config sandbox. Most agent logic lives in simple .md files with YAML frontmatter for schemas. The virtual sandbox (just-bash style) gives agents immediate filesystem and shell access without extra configuration. Python tools offer similar pieces, but they are not integrated as seamlessly or with the same low boilerplate.
Headless, runtime-agnostic deployment. Flue agents run anywhere, Node.js, Cloudflare, CI/CD pipelines without baked-in assumptions about human oversight. Python frameworks can achieve this, but Flue makes the entire workflow feel opinionated and delightful from the first line.

In short, Flue raises the bar from “build your own harness” to “use the harness.” PyFlue brings this exact philosophy to Python while adding capabilities that fit the ecosystem naturally.

What PyFlue Is

PyFlue is a Python-first agent harness framework for building autonomous agents. It supports Markdown skills, persistent sessions, sandboxed filesystem and shell access, typed Pydantic outputs, streaming events, file-based webhook routing, and pluggable backends. DeepAgents is the default harness, with support for OpenAI Agents SDK, Google ADK, PydanticAI, and custom implementations.

Agents = Model + Harness + Memory + Secure Sandbox.

Core Features

Markdown Skills and Roles Define reusable workflows in .agents/skills/*.md with YAML frontmatter for input and output schemas. Global context lives in AGENTS.md. Scoped behavior comes from .agents/roles/*.md files. Apply roles easily:

result = await session.prompt("Review this patch", role="coder")

File-Based Routing and Triggers Place Python files in the agents/ directory to create webhook routes automatically. Triggers support webhook and schedule metadata for external schedulers.
Stateful Sessions and Tasks Sessions persist with SQLite by default and support stable IDs for resumption. Tasks create focused child agents with isolated history but shared sandbox.
Policy-Gated Sandbox The default virtual sandbox provides controlled access to read, write, edit, grep, glob, and shell commands. Policies enforce safety with allow_write, allow_shell, and command allowlists. Secret grants pass credentials only for specific calls without exposing them to the model. Remote sandboxes (Daytona, E2B, Modal, Runloop) are supported via extras.
Typed Outputs and Streaming Every skill or prompt accepts a Pydantic model for validation. Streaming works in Python, the CLI, and SSE endpoints.
CLI and Deployment The pyflue CLI mirrors Flue’s workflow with commands for init, run, dev, build, and deploy across Docker, Railway, Render, Fly.io, Vercel, Netlify, Cloudflare, and CI templates.
Pluggable Harnesses DeepAgents is the default. The backend registry lets you swap harnesses without changing agent code.
Monty Support: The Pydantic Monty is already supported for Python backend.

PyFlue vs Flue

PyFlue preserves Flue’s core strengths while adding Python-specific advantages:

Markdown skills plus roles for enhanced reuse
Virtual sandbox with explicit policies and secret grants
File-based webhook routing with automatic endpoints
Pydantic typed outputs
Extensive deployment targets and CLI tools

Python teams gain deep integration with data and ML libraries without sacrificing Flue’s delightful developer experience.

Quickstart in 30 Seconds

uv add pyflue
pyflue init my-agent
cd my-agent
pyflue run --prompt "Review this project"

A minimal agent looks like this:

from pydantic import BaseModel
from pyflue import init

class FixResult(BaseModel):
    fix_applied: bool
    summary: str

async def main():
    agent = await init(
        model="openai:gpt-5.5",
        harness="deepagents",
        sandbox="virtual",
        allow_write=True,
        allow_shell=True,
    )
    session = await agent.session("fix-123")
    result = await session.skill(
        "triage",
        args={"issue_number": 123},
        result=FixResult,
    )
    if result.fix_applied:
        await session.shell("git status --short")

Who Should Use PyFlue

Use PyFlue if you are a Python-first team building coding agents, issue triage systems, data analysis agents, or workflow automation that needs controlled file and shell access. It is ideal for developers who appreciate Flue’s low-boilerplate approach but prefer to stay in Python for its rich libraries and mature tooling.

Get Started Today

PyFlue is open source and ready for use. Visit the GitHub repository or the documentation site to explore the full feature matrix and concept guides.

uv add pyflue
pyflue init my-first-agent

GitHub: https://github.com/SuperagenticAI/pyflue
Docs: https://superagenticai.github.io/pyflue/

We look forward to seeing what you build. Star the repository, contribute skills or roles, or share your agents with the community.

The agent harness moment has arrived for Python. Welcome to PyFlue.

Bringing HALO and Agentic Harness Engineering Concepts Into Our Open Harness Stack

Shashi Jagtap — Sat, 02 May 2026 09:42:11 GMT

Harness engineering is moving from intuition to observability. Recent work around HALO and Agentic Harness Engineering (AHE) points to an important shift in agent development. The next frontier is not only building stronger coding agents. It is building better systems around those agents. The harness decides what the model can see, which tools it can use, how errors are surfaced, how traces are stored, and how improvements are evaluated. When the harness is weak, even strong models can fail in avoidable ways.

At Superagentic AI, we have now implemented the first practical version of these ideas across our open harness stack:

RLM Code v0.1.8
MetaHarness v0.2.2

This is not a clone of HALO or the AHE reference implementation. It is our own implementation of the core ideas inside our existing tools. HALO gives us the trace-analysis loop. AHE gives us the observability structure. RLM Code and MetaHarness now connect both into a working open-source harness optimization workflow.

The Problem: Harnesses Fail Quietly

Coding agents do not fail only because the model is weak. They often fail because the harness around the model is brittle.

Common harness-level failures include:

Hallucinated tool calls
Confusing or redundant tool arguments
Missing feedback after failed commands
Refusal loops
Weak retry behavior
Poor task decomposition
Bad context compaction
Unclear system prompts
Middleware that hides useful state
Evaluation signals that do not explain why a candidate improved or regressed

These are not only model-training problems. They are engineering problems. The challenge is that they are hard to debug manually. A serious benchmark run can produce thousands or millions of trace tokens. Reading raw logs by hand does not scale. Making random prompt edits is not enough either, because it does not reliably explain which change caused which outcome. This is where HALO and Agentic Harness Engineering become useful.

What HALO Showed

HALO, or Hierarchical Agent Loop Optimizer, showed that execution traces can be used as direct evidence for
harness improvement.

The basic loop is simple:

collect traces
    -> analyze repeated failure modes
    -> produce a report
    -> feed that report to a coding agent
    -> update the harness
    -> evaluate again

The important part is that the optimizer is not guessing. It is grounded in what the agent actually did. If traces show hallucinated tool calls, the harness can add stricter tool-selection guidance. If traces show redundant tool arguments, the tool schema can be simplified. If traces show refusal loops, the system prompt or middleware can be changed. HALO made the trace-to-harness-improvement loop explicit.

What Agentic Harness Engineering Added

The Agentic Harness Engineering paper adds a stronger structure around the same idea. AHE frames harness optimization around three observability layers:

Component observability: The harness should be decomposed into editable pieces such as prompts, tools,
tool descriptions, middleware, skills, memory, sub-agents, evaluators, and orchestration.
Experience observability: Raw traces should be distilled into layered evidence, including overview
reports, per-task or per-trace detail files, and raw traces for drill-down.
Decision observability: Every harness edit should come with a change manifest that explains what changed,
which evidence motivated it, what it is expected to fix, and what might regress.

This is the key shift. Prompt optimization says:

try a better prompt and see if the score improves

AHE-style harness optimization says:

make an evidence-backed change
  record the prediction
  evaluate the result
  attribute the outcome
  keep, discard, or redesign based on evidence

That is a more serious engineering loop. It makes harness improvement observable, auditable, and falsifiable.

What Superagentic AI Built

Superagentic AI has implemented these concepts across two open-source repositories:

Together, they now support a practical workflow for trace-grounded, evidence-backed harness optimization.

RLM Code v0.1.8: Experience Observability

RLM Code now handles the trace-analysis side of the loop. We added a HALO and AHE-inspired trace_analysisenvironment that works over OpenTelemetry-shaped JSONL traces.

It supports:

Sidecar trace indexing
Dataset overview
Trace querying
Trace counting
Trace search
Full trace viewing for small traces
Selected span viewing for large traces
Bounded payload handling
Layered evidence corpus export

The new core action is:

export_evidence_corpus

This writes a structured evidence bundle:

trace-evidence/
    overview.md
    index.json
    detail/
      trace-error-1.md
      trace-error-2.md
    raw/
      trace-error-1.jsonl
      trace-error-2.jsonl

The output is designed for another coding agent or MetaHarness to consume.

overview.md gives the high-level diagnosis. detail/*.md gives per-trace evidence. raw/ *.jsonl keeps processed span data available for drill-down. index.json makes the corpus machine-readable. This implements the AHE idea of experience observability. Traces are not dumped into a prompt blindly. They become a layered evidence corpus.

Example workflow:

/rlm run "Find systemic harness failures trace=./traces.jsonl" env=trace_analysis steps=6

Then export the evidence corpus:

{
    "action": "export_evidence_corpus",
    "output_dir": "./trace-evidence",
    "filters": {"has_errors": true},
    "limit": 100,
    "include_raw": true
  }

The resulting trace-evidence/overview.md can then be passed into MetaHarness.

MetaHarness v0.2.2: Decision Observability

MetaHarness now handles the harness evolution side. We added support for trace-grounded candidate generation and AHE-style change manifests. MetaHarness already creates candidate workspaces, runs proposal backends such as Codex or Gemini, validates changes,
evaluates candidates, and stores run artifacts.

Now it also supports:

--trace-evidence
Candidate evidence injection
.metaharness/evidence/trace_evidence.md
.metaharness/change_manifest.json
Archived proposal manifests
Optional task-level change attribution
Ledger fields for changed components and verdicts

A MetaHarness run can now receive trace evidence like this:

uv run metaharness run ./my-harness \
    --backend codex \
    --hosted \
    --trace-evidence ./trace-evidence/overview.md \
    --budget 1

Each candidate workspace receives:.metaharness/evidence/trace_evidence.md The proposer prompt explicitly tells the coding agent to use that evidence when making harness changes. Before finishing, the proposer is now instructed to write:

.metaharness/change_manifest.json

A manifest looks like this:

{
    "schema_version": "metaharness.change_manifest.v1",
    "candidate_id": "c0001",
    "parent_candidate_ids": ["c0000"],
    "changes": [
      {
        "id": "change-1",
        "component": "tool_description",
        "description": "Clarified the tool argument contract.",
        "files": ["tools/search.py"],
        "failure_pattern": "The agent repeatedly passed redundant arguments.",
        "evidence_refs": ["trace_evidence.md#trace-error-1"],
        "root_cause": "The tool schema allowed ambiguous argument combinations.",
        "targeted_fix": "Make the required argument explicit and remove redundant alternatives.",
        "predicted_fixes": ["task-search-001"],
        "risk_tasks": ["task-search-legacy"],
        "notes": "Should reduce malformed tool calls."
      }
    ]
  }

MetaHarness archives this under:candidates//proposal/change_manifest.jsonIf the evaluator returns task-level results, MetaHarness can also generate:

candidates//proposal/change_attribution.json

That attribution report compares predicted fixes, risk tasks, actual improvements, and regressions. Verdicts include:

EFFECTIVE
PARTIALLY_EFFECTIVE
MIXED
INEFFECTIVE
HARMFUL

This gives MetaHarness a real decision ledger. The point is not just to find the winning candidate. The point is to
understand why a change worked or failed.

The New Superagentic Harness Loop

Together, RLM Code and MetaHarness now support this loop:

agent execution traces
    -> RLM Code trace_analysis
    -> layered evidence corpus
    -> MetaHarness --trace-evidence
    -> candidate harness edits
    -> change_manifest.json
    -> validation and evaluation
    -> change_attribution.json
    -> candidate ledger
    -> keep, discard, or redesign

This is the practical implementation of HALO and AHE concepts in the Superagentic AI stack.HALO-style trace analysis gives us the evidence. AHE-style manifests make the decisions auditable. MetaHarness turns that into an optimization workflow.

Why?

This moves harness optimization away from guesswork. Instead of saying:

The new prompt seems better.

We can say:

This candidate changed the tool description because traces showed malformed tool calls.
  It predicted task A would improve and task B might regress.
  After evaluation, task A improved and task B stayed stable.
  The change is EFFECTIVE.

That is a stronger engineering primitive. It makes harness improvement:

Observable
Reproducible
Auditable
Benchmarkable
Falsifiable

This is especially important for coding agents because the harness is now a large part of the product.

The model is only one part of the system. The real agent experience comes from the model plus tools, memory, middleware,
prompts, execution environment, tracing, evaluation, and rollback behavior.

What We Did Not Do

We did not copy the HALO repository. We did not copy the AHE reference implementation. We did not tie our stack to NexAU,
Harbor, E2B, or any one benchmark runtime.

Instead, we implemented the transferable ideas:

Trace analysis
Layered evidence
Evidence injection
Change manifests
Task-level attribution
Candidate ledgers

This keeps the Superagentic AI stack modular.

RLM Code remains the recursive trace-analysis and experimentation layer. MetaHarness remains the optimization and candidate-
evaluation layer. The two now work together.

What Comes Next

This is the first practical version of the loop.

Next, we plan to improve the system in several directions:

Richer task-level grouping in RLM Code
Automatic MetaHarness ingestion of the full evidence corpus, not only overview.md
Better component-level rollback suggestions
Stronger OpenTelemetry compatibility
Multi-run attribution across candidates
Benchmark reports showing before and after gains
Tighter integration between trace spans and candidate manifests

The long-term direction is clear:

self-improving harnesses need observability first

Before an agent can improve its own harness, it needs to see what happened, understand what changed, and verify whether the
change worked. That is what we have started building.

Summary

Superagentic AI has now implemented HALO- and AHE-inspired observability loops across its open harness stack. In RLM Code v0.1.8, we added layered trace evidence export. In MetaHarness v0.2.2, we added trace-grounded candidate proposals, change manifests, task-level attribution, and candidate ledger support.

Together, they create a practical open-source workflow for observable harness optimization:

traces -> evidence -> harness edits -> manifest predictions -> eval -> attribution

This is the beginning of falsifiable harness engineering.

Agentic Harness Engineering: The Next Frontier After Harness Engineering

Shashi Jagtap — Thu, 30 Apr 2026 15:11:44 GMT

Harness Engineering has become AI trend these days and now it reached to the next level. In our earlier post, Harness Engineering: The Hottest Topic in AI Agent Engineering, we covered that the model is no longer the whole story. Agent performance is shaped by the system around it: tools, memory, instructions, constraints, execution runtime, evaluation, and feedback loops.

The new paper Agentic Harness Engineering:Observability-Driven Automatic Evolution of Coding-Agent Harnesses, on April 29 pushes that argument forward. Its central claim is: the bottleneck in self-improving coding-agent harnesses is not only agent capability but also observability. In other words, coding agents cannot reliably improve their own harnesses if the harness is a black box, the action space is unclear, the execution traces are unstructured, and edits are not tied to measurable predictions. Once the system exposes the right surfaces, a separate evolution agent can begin to improve the harness itself. This is the jump from Harness Engineering to Agentic Harness Engineering. Harness Engineering asks how humans design better environments for agents. Agentic Harness Engineering asks how we build environments that agents can inspect, revise, test, and improve over time.

What The AHE Paper Introduces: Probably GEPA for Coding Agent harness

The paper introduces Agentic Harness Engineering, or AHE, as a closed-loop system for evolving coding-agent harnesses. It holds the base model fixed and lets an evolution loop edit the harness around that model. That distinction matters. AHE is not a new model. It is not just prompt optimization or one-off benchmark
trick. It sounds like a framework for modifying the external system that controls how a coding agent sees the repository, calls tools, uses memory, manages risk, and reacts to execution results. The authors frame the hard part as a three-part observability problem:

Component observability
Experience observability
Decision observability

Together, these make harness evolution inspectable enough for another agent to operate on.

Pillar 1: Component Observability

Most agent harnesses are tangled. The system prompt, tool descriptions, tool implementations, middleware, memory, skills, and sub-agent routing often live as one inseparable blob of code and configuration. That makes automated improvement difficult. If a run fails, the optimizer cannot tell whether the problem came from
the prompt, a missing tool affordance, an unsafe middleware policy, an unhelpful memory entry, or an execution-time
guard. AHE addresses this by representing editable harness components as file-level artifacts. The paper’s NexAU substrate
exposes seven component types:

System prompt
Tool description
Tool implementation
Middleware
Skill
Sub-agent configuration
Long-term memory

This turns the harness into an explicit action space. An evolution agent can edit a specific component, inspect the
diff, test the result, and roll it back if needed. This is one of the most important engineering ideas in the paper. Self-improvement does not start with autonomy. It
starts with decomposition. If the harness cannot be decomposed, the optimizer cannot assign credit.

Pillar 2: Experience Observability

Coding-agent runs produce enormous traces. A long-horizon coding task may include shell commands, file edits, test failures, environment assumptions, partial fixes, retries, tool errors, and final verification attempts. Raw traces are not enough. They contain signal, but they are too large and too messy for a future optimization step to
consume directly.

AHE introduces an Agent Debugger role that distills rollout trajectories into a layered evidence corpus. Instead of asking the evolution agent to read millions of raw tokens, the system produces structured reports that support drill-down:

Task-level outcomes
Success and failure patterns
Root-cause analysis
Benchmark-level summaries
Evidence linked back to the underlying trajectory

This is the difference between logging and observability. Logging says: here is everything that happened. Observability says: here is the structure that helps another actor understand why it happened. For agent harnesses, this matters because the next improvement may depend on a repeated pattern. Maybe the agent keeps
destroying a verified state after a success. Maybe it repeatedly guesses dependency setup instead of inspecting the project. Maybe it wastes turns rediscovering the same repository structure. Maybe it misses that a shell command sequence creates cross-step risk. Those are harness problems, but they only become visible when the trajectory is distilled into a form that can be acted on.

Pillar 3: Decision Observability

The most interesting part of AHE is not only that it edits the harness. It requires edits to make predictions. Each change is paired with a self-declared expectation about what should improve in the next round. Later, the system checks that prediction against task-level outcomes. That turns a harness edit into a falsifiable contract. This is a major step toward making agentic self-improvement more scientific. Without decision observability, an evolution loop can become a pile of trial-and-error patches. With decision observability, every edit has a reason, an expected effect, and a later verification point. For teams building production coding-agent systems, this is the habit worth adopting immediately: do not only record what changed. Record why it changed, what the change predicted, and whether the next run confirmed it.

The Results: Why This Paper Is Special

The paper reports a 10-iteration AHE campaign on Terminal-Bench 2, starting from a minimal bash-only seed harness called NexAU0. The paper’s component ablation is especially revealing:

Swapping in AHE’s long-term memory alone improves the seed by +5.6 percentage points
Swapping in AHE’s tools alone improves the seed by +3.3 percentage points
Swapping in AHE’s middleware alone improves the seed by +2.2 percentage points
Swapping in AHE’s system prompt alone regresses by -2.3 percentage points

That last result should be uncomfortable for anyone still treating agent improvement as mostly prompt work. The gain did not live in the prose alone. It lived in the executable and operational structure around the model.

Transfer Matters More Than The Leaderboard

The paper also reports transfer experiments. The evolved AHE harness was frozen and moved to SWE-bench-verified without re-evolution. It achieved the highest aggregate success rate in the reported comparison while using fewer tokens than the seed. The paper reports 75.6% aggregate success for AHE versus 75.2% for NexAU0, with token usage dropping from 526k to 461k. The cross-model story is also notable. When the AHE harness was evaluated with alternate model families, the paper reports gains ranging from +2.3 to +10.1 percentage points. The largest gains appeared on weaker or less saturated base models. That suggests the evolved harness is not only memorizing a benchmark. It is encoding reusable coordination patterns: how to protect verified state, how to manage shell risk, how to structure tool use, and how to make progress through long-horizon software tasks. This is the real promise of Agentic Harness Engineering. If the harness captures reusable engineering experience, then harness evolution becomes a way to improve many agents, not just one model on one benchmark.

How This Connects To Meta-Harness

AHE arrives only weeks after the Meta- Harness paper, which introduced an outer-loop system for optimizing model harness code. Meta-Harness argued that
harness search needs richer access to prior experience than classic text optimizers provide. Instead of compressing feedback into tiny summaries or scalar scores, Meta Harness gives an agentic proposer filesystem access to prior candidate code, traces, and scores. That idea is directly aligned with what we have been building in the open source Superagentic MetaHarness library. MetaHarness is a filesystem-first Python library for optimizing executable harnesses around agentic coding systems. It
treats files like AGENTS.md, GEMINI.md, bootstrap scripts, validation scripts, test flows, routing logic, and benchmark glue as optimization targets. The current OSS implementation already reflects several principles that AHE now makes even sharper.

Component observability

Why it matters: the editable harness surface must be explicit and revertible. MetaHarness alignment today: candidate workspaces, diffs, allowed_write_paths, instruction files, scripts, and helper code are stored as concrete files. Next opportunity: add a clearer component taxonomy for prompts, tools, middleware, memory, skills, and runtime policies.

Experience observability

Why it matters: future optimization needs structured evidence, not raw chaos. MetaHarness alignment today: runs store manifests, validation results, evaluation results, diffs, bootstrap snapshots, parent artifacts, ledgers, and compare outputs. Next opportunity: add richer trajectory distillation and root-cause summaries inspired by AHE’s Agent Debugger.

Decision observability

Why it matters: every edit should declare a prediction and be checked later. MetaHarness alignment today: MetaHarness already records outcomes such as keep, discard,
crash, timeout, no-change, and scope-violation.

Next opportunity: add per-edit prediction manifests and next-round verification reports.

This makes AHE less like a distant academic result and more like a product roadmap for practical harness optimization. Meta-Harness showed that harness code should be optimized as code. AHE shows that self-optimizing harnesses need observability across components, experience, and decisions.

Superagentic MetaHarness sits naturally between those ideas: a practical OSS workbench where harnesses are stored, mutated, evaluated, compared, and made inspectable.

Where HALO Fits

HALO, or Hierarchical Agent Loop Optimization, is highly related to this same movement.

HALO is an RLM-based method for recursively improving agent harnesses from execution traces. Its loop is:

Collect traces from an agent harness, using OpenTelemetry-compatible tracing
Feed those traces into the HALO-RLM engine
Have the RLM decompose the traces and identify common failure modes
Feed the findings into a coding agent such as Cursor, Claude Code, or a similar code-editing agent
Redeploy the harness, gather more traces, and repeat

The reported AppWorld results are strong: the HALO README reports Sonnet 4.6 moving from 73.7% to 89.5% on dev SGC, and Gemini 3 Flash moving from 36.8% to 52.6%. On the held-out test_normal proxy, the README reports +10.7 point gains for both models. Conceptually, HALO is very close to AHE’s experience-observability pillar. Both systems are based on the same core intuition: the most valuable harness improvements often come from analyzing the agent’s own execution traces, finding repeated systemic failures, and converting those failures into harness edits. The difference is mostly where each system places the strongest abstraction.

HALO

Primary abstraction: RLM over trace data. What it optimizes: harness failures surfaced from OpenTelemetry-style traces. Current shape: trace analysis report, then coding-agent-driven harness updates.

AHE

Primary abstraction: observable self-evolution loop. What it optimizes: system prompt, tools, middleware, memory, skills, and other harness components. Current shape: component, experience, and decision observability in one closed loop.

Meta-Harness

Primary abstraction: filesystem-backed harness search. What it optimizes: executable harness code and candidate workspaces. Current shape: agentic proposer reads prior code, traces, scores, and artifacts.

Superagentic MetaHarness

Primary abstraction: practical OSS outer-loop optimizer. What it optimizes: instruction files, scripts, validation flows, routing logic, and benchmark glue. Current shape: candidate workspaces, diffs, manifests, ledgers, outcomes, and run comparison.

So yes: HALO is part of the same concept family. It is best understood as an operationally useful trace-to-feedback layer for harness optimization. AHE is a broader research framework that also insists on explicit editable components and prediction-linked decisions. Meta-Harness and Superagentic MetaHarness provide the filesystem-backed optimization environment where those ideas can become reusable engineering workflows.

The convergence is the important story. Meta-Harness, AHE, and HALO all point in the same direction: future agent improvement will come from trace-aware, evidence-driven harness evolution, not just bigger models or better one-shot prompts.

Where RLM Code Fits

RLM Code is also relevant to this direction. RLM Code provides an interactive environment for running LLM-powered agents in a REPL loop, benchmarking them,
comparing results, replaying trajectories, and inspecting observability data. It also includes a coding-agent harness path through /harness run, plus a CodeMode strategy with MCP tool discovery, guarded code execution, and side-by-side benchmark telemetry. The upcoming RLM Code release also adds a HALO-style trace_analysis environment. It can read OpenTelemetry-shaped JSONL traces, build a sidecar index, query trace summaries, search individual traces, view selected spans, and produce an evidence report without blindly loading huge traces into context. That makes RLM Code a useful lab for the experience-observability side of Agentic Harness Engineering.

In practical terms:

MetaHarness is the outer-loop optimizer for executable harness artifacts
RLM Code is a runtime and research environment for agent execution, trajectory capture, replay, comparison, and HALO-
style trace diagnosis
Superagentic MetaHarness can consume the resulting trace report through --trace-evidence, copying it
into each candidate workspace and embedding it in the proposer prompt
Together, they point toward an end-to-end workflow: run agents, collect trajectories, distill failures, evolve
harness files, verify changes, and keep a durable evidence trail

A typical workflow looks like this:

/rlm run "Find systemic harness failures trace=./traces.jsonl" env=trace_analysis steps=6

uv run metaharness run examples/python_fixture_benchmark \
    --backend codex \
    --hosted \
    --trace-evidence ./trace_evidence.md \
    --budget 1 \
    --run-name trace-grounded-codex

That is the shape of serious agent engineering: traces become evidence, evidence becomes candidate changes, candidate changes become evaluated artifacts, and the whole process stays inspectable.

Bigger Than Coding Benchmarks

The AHE paper focuses on coding agents, and that is the right place to start. Coding agents expose concrete artifacts, deterministic tests, shell traces, repository state, and measurable pass or fail outcomes. But the pattern generalizes. Any production agent harness has components:

Tools
Memory
Retrieval
Runtime policies
Permissions
Handoff rules
Validation checks
Retry strategies
Evaluator feedback
User-facing state

Any production agent also produces experience:

Task traces
Tool errors
Failed plans
Unsafe actions
Repeated corrections
Unresolved edge cases
Successful patterns

And any mature team needs decision observability:

What changed?
Who or what changed it?
Why was it changed?
What was expected to improve?
Did it actually improve?
Should we keep, revise, or revert?

This is why Harness Engineering is becoming a core line inside Agent Engineering. Agents are not only prompted but operated, evaluated, constrained. They are given tools and memory. They are routed through workflows. They need rollback, telemetry, and accountability. AHE gives the field a concrete research vocabulary for that reality.

The Caution: AHE Is Powerful, But Not Magic

The paper is careful about limitations, and builders should be too. AHE is still a controlled research prototype. It adds engineering and compute overhead because each iteration requires benchmark execution, trajectory analysis, and workspace management. It expands the adaptation surface, which creates more opportunity for benchmark-specific tuning. It includes governance mechanisms such as bounded edits, attribution, and rollback, but it is not a complete guardrail stack. Autonomous harness evolution should not mean “let the agent rewrite everything and hope.” It should mean:

Bounded editable surfaces
Explicit component ownership
Structured trajectory evidence
Cheap validation before expensive evaluation
Candidate ledgers
Rollback paths
Prediction manifests
Human-reviewable diffs
Repeatable benchmark runs

The future is not uncontrolled self-modifying agents. The future is observable, testable, reviewable self-improving harnesses.

What Builders Should Do Now

If you are building coding agents, this paper suggests a practical checklist.

Make your harness file-addressable. Prompts, tool definitions, middleware policies, memory seeds, skills, and runtime configs should be inspectable artifacts, not hidden constants.
Store candidate history. Keep the code, diffs, evaluation scores, validation output, and execution traces for every attempted improvement.
Add outcome discipline. Classify candidates as kept, discarded, timed out, crashed, unchanged, or out of scope.
Turn logs into evidence. Raw traces are useful, but future optimization needs summaries that preserve root causes and allow drill-down.
Require change predictions. Every proposed harness edit should say what it expects to improve. Later runs should check that claim.
Separate cheap validation from expensive evaluation. Do not spend full benchmark cycles on candidates that fail basic syntax, scope, or safety checks.
Treat tools, middleware, and memory as first-class optimization targets. The AHE ablation results are awarning: the strongest improvements may not live in the system prompt.

Why We Are Hosting A Harness Engineering Event

This is exactly why Agent Engineering HQ is hosting Harness Engineering: State of the Art in Agent Harnesses at AWS Builder Loft in San Francisco. The event is focused on the technical reality behind modern coding-agent harnesses: orchestration, memory, tool design, verification, error recovery, execution runtimes, trace analysis, and self-optimizing harnesses. We will be discussing the recent wave of harness research, including Meta-Harness, Natural-Language Agent Harnesses,
HALO, and now Agentic Harness Engineering. More importantly, we want to bring together the engineers and researchers
actively building these systems in practice. If you care about reliable coding agents, agent evaluation, memory and harness alignment, self-healing agent systems,
or production-grade agent infrastructure, this is the room to be in.

RSVP here: https://luma.com/rtd0f6ka

Space is limited. RSVP early.

Final Thought

Harness Engineering made the model-plus-harness equation visible. Meta-Harness showed that the harness itself can be optimized as executable code. HALO shows that trace analysis can turn messy agent runs into concrete harness improvement reports. Agentic Harness Engineering now shows what the next phase requires: observability that lets agents improve the harness without collapsing into blind trial and error. The future of agent performance will not come only from larger models. It will come from better harnesses, better evidence, better feedback loops, and better engineering discipline around the systems that let models act. That is the next frontier of Agent Engineering.

Sources And Further Reading

Launching the London Agentic AI Community Website

Shashi Jagtap — Wed, 29 Apr 2026 08:46:38 GMT

Superagentic AI is excited to announce the launch of the new official website for London Agentic AI, a community created for engineers, researchers, founders, and practitioners building real-world AI agents in London.For now, London Agentic AI community of almost 4.5k builder has been scatter all over Meetup and Luma. However we have new home for the London Agentic AI website.The new website is now live at

londonagenticai.com

London Agentic AI started in April 2025 as a builder-led community focused on production-grade agentic AI. Over the past year, the community has grown across Meetup and Luma, bringing together thousands of people interested in agents, coding agents, MCP, ACP, evals, RAG, observability, workflows, and production systems.

As the community grew, one challenge became clear. Event information was spread across multiple platforms, including Meetup, Luma, YouTube, and social media. People often asked where they could find upcoming sessions, past events, speaker information, videos, and sponsor details in one place.

The new London Agentic AI website solves that problem by creating a single home for the community.

What the Website Includes

The website brings together the key parts of the London Agentic AI ecosystem. Visitors can now explore upcoming events, past sessions, speaker details, videos, sponsor information, and the wider Agentic AI network that the community is building around.

It also gives sponsors, speakers, and attendees a clearer view of what London Agentic AI represents. The site highlights the community’s focus on high-quality technical events, real-world systems, and the growing London ecosystem around agentic AI.

A Home for London Agentic AI Events

London Agentic AI has hosted events with teams and organisations including Google DeepMind, Databricks, JetBrains, Tessl, Zed, and AutogenAI. These sessions have covered topics such as agentic coding, evaluation frameworks, Model Context Protocol, Agent Client Protocol, agentic RAG, production agents, and enterprise AI systems.

With the new website, these events can now be presented in a more structured and discoverable way. Instead of searching across different platforms, attendees can visit one place to understand what has happened, what is coming next, and how to get involved.

Website Demo

We have also published a short demo video showing the new London Agentic AI website.

Next Events

The next phase of London Agentic AI will continue to focus on high-signal technical sessions for people building real systems. Upcoming events will cover enterprise agents, coding agents, evaluation, observability, workflow orchestration, and production architectures.

You can explore upcoming events and register through the official website:

Visit London Agentic AI
.

London is becoming AI Hub

London is becoming an important hub for agentic AI. Research labs, startups, enterprise teams, developer-tool companies, and universities are all contributing to the growth of this space.

London Agentic AI exists to connect that ecosystem through technical events, community discussions, speaker sessions, and real-world examples of agentic AI in practice.

For Superagentic AI, supporting this website is part of a broader mission to help build the infrastructure, knowledge, and community around agentic systems.

Explore the Website

The new website is now live and will continue to evolve as the community grows.

Visit the official London Agentic AI website here:

https://londonagenticai.com

For speaking, sponsorship, hosting, or collaboration opportunities, please visit the website or contact the London Agentic AI team.

Join the Community on Luma and Meetup

Luma: https://luma.com/londonagenticai

Meetup: https://www.meetup.com/london-agentic-ai

Stay Up Do date on Our website :

https://londonagenticai.com

Year One of Superagentic AI: From Apple to Agentic AI Engineering

Shashi Jagtap — Tue, 28 Apr 2026 14:36:36 GMT

Today, April 28 2026, marks the first anniversary of Superagentic AI. One year ago, on April 28 2025, Superagentic AI was officially incorporated. A few days before that, I had completed my final day at Apple (24 April) after nearly six years. What began as a deeply personal transition from a stable and meaningful chapter at Apple has become a full stack Agentic AI company operating across the United Kingdom and the United States.

This post is a reflection on that first year. It is also a record of what can be built by a solo founder with conviction, technical focus, open source discipline, and a clear belief that the future of software will be shaped by agents, optimization, memory, protocols, evaluation, and real engineering systems.

Superagentic AI was not started as a polished corporate idea or not based on the popular OSS library with massive users. It began as a founder led experiment that prepare myself for next decade of the technical revolutions. It began with years of developer experience, and a belief that Agentic AI needed more than demos. It needed engineering discipline and solid future-proof technologies that sit on top of 5 basic pillars of the Superagentic AI. The five core pillars of the Superagentic AI are Agent Engineering, Agent Experience, Agentic DevOps, Quantum AI and Agentic Co-Intelligence.

One year later, Superagentic AI has grown into a company with products, open source projects, research initiatives, community programs, technical writing, podcasts, events, and a growing movement around Agent Engineering and helping clients to build tools and frameworks to support their journey.

The Origin: From Apple to Superagentic AI

The story started with a decision to leave Apple and begin a new chapter. After nearly six years at Apple, I wrote the founding post, Life 3.0: Goodbye Apple. Welcome Superagentic AI. That post captured the emotional and professional transition from working on Developer Experience at Apple to starting a company focused on the next generation of intelligent software systems.

Apple shaped the way I think about systems, quality, developer workflows, experience, and long term engineering discipline. That background became central to Superagentic AI . The idea was not how humans experience software nor how developer experience software. It was how agents experience software, the blog post on Agent Experience from May 2025 suggest that Superagentic AI was thinking ways ahead of this than industry is realising now. It was hugely inspired by the Netlify’s Agent Experience ideas. That was the beginning of Agent Experience.

At Apple, I worked at Developer Experience team building tools and frameworks for the internal Apple developers working on Xcode, iOS, macOS, tvOS and watchOS SDKs. Developer Experience has been one of the most important disciplines in modern software. It helps humans build, test, ship, and operate software effectively. But agents introduce a new requirement. Agents do not use software in the same way humans do. They do not browse, scroll, or click in the same way. They operate through structure, intent, memory, tools, context, constraints, and feedback loops.

Superagentic AI was created around the belief that future systems must be designed not only for developers, but also for agents. Agents need environments they can understand. They need protocols they can follow. They need memory they can retrieve from. They need tools they can use safely. They need evaluations that can measure their behavior. They need optimization loops that can improve their performance.

This was the shift from Developer Experience to Agent Experience.

Year 2025: Four Months at Apple, Eight Months Building Superagentic AI

2025 was not a normal year. It was a transition year.The first four months were spent at Apple, reflecting deeply on the future of software, the rise of AI agents, and the kind of work worth dedicating the next decade to. Leaving Apple was not an impulsive decision. It involved reflection, family discussions, financial planning, and honest self questioning. On April 24 2025, I completed my final day at Apple. On April 28 2025, Superagentic AI officially began. From that moment, the rest of the year became an intense experiment in building, learning, failing, restarting, writing, shipping, and trusting conviction. It was the beginning of a company of one, but it was never a small ambition. The mission was to build the infrastructure, tools, research, and community needed for the Agentic AI era.

The Thesis: Agent Engineering and Agent Experience are the Missing Discipline

Superagentic AI was built around one central thesis: the agentic era needs engineering discipline started with DSPy’s ideas and concepts and built agent frameworks and evolved later as technology shaped the AI. Large language models are powerful, but they are not enough by themselves. Prompting is useful, but prompting alone is not enough. Context is important, but context alone is not enough. A framework can help, but a framework alone is not enough.

Production grade agents require a complete engineering stack. They need prompt engineering, context engineering, harness engineering, eval engineering, memory engineering, skills engineering, guardrail engineering, inference engineering, orchestration, observability, and optimization. This is what Superagentic AI calls Agent Engineering.

Agent Engineering is the discipline of designing, building, evaluating, optimizing, and operating reliable agentic systems. It is about moving agents from prototypes to production. It is about creating systems that can be tested, observed, improved, and trusted.

From day one, Superagentic AI focused on making Agentic AI production-worthy. The company was not created to chase short term hype. It was created to build for the deeper infrastructure layer of the agentic future. One important thing, not to impress investors.

The Five Pillars of Superagentic AI

The company was shaped around five pillars: Agent Engineering, Agent Experience, Agentic DevOps, Agentic Co-Intelligence, and Quantum AI. These pillars were not abstract branding. They became a practical product and research map.

Agent Engineering became the foundation for SuperOptiX, the full stack framework and optimization platform for building, evaluating, and orchestrating agents.
Agent Experience became the philosophy behind SpecMem, AgentVectorDB, CodeOptiX, and the broader idea that systems must be understandable and operable by agents themselves.
Agentic DevOps became the direction behind tools that help agents participate in software delivery, coding workflows, quality engineering, automation, and optimization.
Agentic Co-Intelligence became the long term direction for collaboration between humans, agents, and multi-agent systems.
Quantum AI became the research frontier that led to SuperQuantX, an open source step toward unifying Quantum AI development and exploring the intersection of agentic systems and quantum computing.

This structure gave Superagentic AI coherence. Every product, project, blog post, event, and research idea had to connect back to a larger thesis.

The First 90 Days: From Vision to Velocity

The first three months set the tone for everything that followed. By July 2025, Superagentic AI had already published its first major milestone update, 3 Months of Superagentic AI: From Vision to Velocity. The post captured the early execution of the company and showed how quickly the foundation had been built.

During those first 90 days, Superagentic AI launched SuperOptiX, released early open source projects, published technical blog posts, launched podcast episodes, created business focused agentic solutions, and started building community around London Agentic AI.

SuperOptiX was introduced as an evaluation first, optimization core, orchestration ready Agentic AI framework designed for production grade agents. It included early ideas around declarative agent design, DSPy based optimization, memory systems, orchestration, behavioral evaluation, and tiered agent architectures.

AgentSpy was introduced as a protocol first AI agent framework built around DSPy and agentic protocols. AgentVectorDB was introduced as a cognitive memory core for AI agents, designed around retrieval optimized and context rich storage.

The company also introduced early business focused solutions. These represented the first attempt to translate the technical thesis into practical offerings for organizations exploring agentic systems. These solutions helped us to get clients to work with for initial learnings. The first three months were important because they established the company’s operating model. Build quickly. Publish publicly. Open source early. Learn from builders. Improve continuously.

The Build Timeline: From Setup to Systems

The first year did not happen as one big launch. It unfolded month by month.

May was about foundations. Superagentic AI was registered, operations were set up, legal and accounting work began, domains and trademarks were organized, and the first websites were created. At the same time, the first open source projects were released, including AgentSpy and AgentVectorDB.
June was about turning the vision into architecture. SuperOptiX began to take shape as a DSPy powered agent framework. Early work focused on agent bricks, orchestration, optimization, and the overall structure of the platform. This was also the month when the ideas connected with the wider community through the LangChain London talk.
July became the first major product milestone. The first version of SuperOptiX was released. GEPA became part of the optimization direction. The three month review documented the early progress across framework development, blog writing, podcasting, open source, and community building.
August and September moved deeper into optimization, memory systems, model management, SuperSpec, RAG, GEPA, ODSC preparation, UKAI engagement, London Agentic AI, and SuperQuantX. The company was no longer only forming. It was becoming a full technical stack.
October was full of building tools and preparing to exhibit in the ODSC AI Conference in San Francisco, we built

By the end of 2025, Superagentic AI had moved from idea to ecosystem. The work had expanded across products, research, open source, community, writing, events, and global engagement.

The Hard Problem: Agent Optimization

One of the most important lessons of the first year was strategic focus. In the beginning, it was tempting to think the world needed another agent framework. But the market was already filling with frameworks. There were many ways to build agents. The harder question was not only how to build agents. The harder question was how to make agents better. That realization shaped the direction of Superagentic AI. The deeper problem was agent optimization.

Models will continue to improve. Frameworks will continue to evolve. But teams still need ways to optimize prompts, tools, retrieval, memory, traces, workflows, execution paths, and agent behavior. They need ways to measure what works. They need ways to improve systems without constantly rebuilding them from scratch. This became the connective tissue across Superagentic AI. It shaped SuperOptiX, CodeOptiX, SuperOpt, SpecMem, AgentVectorDB, and the broader research direction. The focus moved from simply building agents to optimizing agents.

SuperOptiX: The Flagship Platform

SuperOptiX became the flagship product of Superagentic AI.

Available at superoptix.ai, SuperOptiX is presented as a full stack Agentic AI optimization platform built around evaluation first development, optimization core architecture, and multi-agent orchestration.

The core idea was developers should be able to define agents once, use a declarative specification, and generate native pipelines for different agent frameworks. SuperOptiX is designed to support builders who want flexibility, ownership, observability, and optimization without being locked into a single framework.

SuperOptiX supports generation flows across major agent frameworks including DSPy, OpenAI Agents SDK, Claude Agent SDK, CrewAI, Google ADK, Pydantic AI, DeepAgents, and Microsoft Agent Framework.

The platform is built around SuperSpec, a YAML based declarative specification language for agents. It also connects to the wider optimization vision through GEPA, RAG, memory, context, and framework aware compilation.

SuperOptiX is not only a framework. It is a statement about how agent systems should be built. They should be evaluated from the beginning. They should be optimized continuously. They should be observable. They should work across frameworks. They should be designed for real production use.

Open Source as an Operating Principle

Open source has been central to Superagentic AI from the beginning. The Superagentic AI GitHub organization at github.com/superagenticAI now shows a growing collection of public repositories focused on Agentic AI tools, frameworks, optimization, memory, Quantum AI, coding agents, and developer infrastructure. By the first anniversary, the GitHub organization showed 32 public repositories.

The portfolio includes projects such as SuperOptiX, dspy-code, SpecMem, AgentVectorDB, SuperOpt, CodeOptiX, SuperQuantX, Meta-Harness, TurboAgents, CodexOpt, Agentnetes, SuperQode, and Agent Engineering 101. These projects are connected by the same underlying thesis. Agents need memory. Agents need optimization. Agents need protocols. Agents need harnesses. Agents need observability. Agents need evaluation. Agents need tools that make production systems more reliable.

The open source work was not a side activity. It was the way Superagentic AI explored the future in public. Each repository represented a question, an experiment, a building block, or a product seed for the agentic era.

This is part of the solo founder spirit behind Superagentic AI. A solo founder cannot outnumber larger teams, but can outlearn, outship, outfocus, and build in public with speed and clarity.

Agentic Coding as the Operating System

Superagentic AI was building tools for agents but also built with agents. From early websites to frameworks, command line tools, integrations, documentation, research prototypes, and community assets, agentic coding became part of the company’s operating system. Working hands on with modern coding agents, AI powered IDEs, command line tools, and automation workflows revealed the strengths and weaknesses of the current agentic coding ecosystem. It also shaped the product direction directly.

This practical experience informed SuperOptiX, CodeOptiX, SpecMem, SuperQode, Meta-Harness, and the broader focus on harness engineering, evaluation, memory, and optimization. Superagentic AI was using agentic coding every day to build the company itself.

Research: SuperOpt and Agentic Environment Optimization

A major milestone in the first year was the release of SuperOpt. SuperOpt represents the research direction of Superagentic AI. The project focuses on Agentic Environment Optimization, which treats the agent environment as the object of optimization. Instead of focusing only on model weights, SuperOpt asks how prompts, tools, retrieval, memory, and traces can be optimized together as one system. This is an important shift. Much of the AI industry remains model centric. SuperOpt is environment centric. The SuperOpt announcement, Introducing SuperOpt: Research on Agentic Environment Optimization for Autonomous AI Agents, explains this direction in more detail. The research direction connects directly to the broader Superagentic AI philosophy. Agents are not just calls to a model. Agents are systems. They contain tools, prompts, context, memory, traces, policies, workflows, and feedback loops. If the environment around the model is weak, the agent remains unreliable. If the environment is optimized, the same model can often perform better. SuperOpt is therefore more than a research project. It is a statement about where the next layer of progress in AI systems may come from.

Quantum AI: SuperQuantX and the Frontier Beyond Classical Agents

The first year also included the launch of SuperQuantX. SuperQuantX was introduced as an open source SDK designed to unify Quantum AI development and provide a practical foundation for exploring the intersection of agentic systems and quantum computing. The launch post, Introducing SuperQuantX, explains the motivation behind building a unified SDK for Quantum AI development. This work sits under the Quantum AI pillar of Superagentic AI. It is early, exploratory, and research oriented. But it matters because it reflects the long term direction of the company. The immediate focus is Agent Engineering, optimization, memory, observability, protocols, and production systems. The longer term frontier includes the intersection of agents and quantum computing. SuperQuantX gives that frontier a practical starting point.

Content as a Product

Superagentic AI was built in public through writing. The founder blog at shashikantjagtap.net became active again after years of silence. The company blog at super-agentic.ai/resources/super-posts became the home for Superagentic AI updates, product announcements, research notes, and technical articles. By the first three months, Superagentic AI had already published sixteen blog posts and launched podcast episodes covering DSPy, multi-agent orchestration, memory systems, and related topics.By the end of 2025, the year in review post, Year in Review 2025: From Apple to Superagentic AI, Building an Agentic Company of One, reflected on eight months of building, learning, experimenting, and committing to a direction that could matter for years.The writing covered Agent Engineering, Agent Experience, Context Engineering, memory, observability, optimization, GEPA, SuperSpec, SuperOptiX, local models, agent protocols, coding agents, SuperOpt, SuperQuantX, and more. This content became public research notes, product education, technical documentation, community building, and a record of how the company’s thinking evolved over time.

London Agentic AI: From Community Idea to Builder Movement

One of the most meaningful achievements of year one was the creation and growth of London Agentic AI. London Agentic AI was formed on April 25 2025, just before Superagentic AI was incorporated. The community was created as a highly technical, vendor neutral space for AI engineers, agent builders, researchers, founders, and technical leaders working on production grade agent systems. The community website at londonagenticai.com describes London Agentic AI as a serious builder community focused on Agentic AI Engineering end to end, including prompts, context, MCP, tools, coding agents, evals, memory, harnesses, guardrails, inference, observability, and safe adoption. By the first anniversary, London Agentic AI reported more than 4,500 AI builder reach and curated technical rooms of 100 to 150 people.

The community also introduced the Agent Lines curriculum, a London inspired map of the Agentic AI Engineering stack. The lines include Prompt Line, Context Line, Harness Line, Eval Line, Memory Line, Skills and Tools Line, Guardrails Line, Inference Line, Protocol Line, AgentOps Line, Coding Agents Line, Orchestration Line, Agent Experience Line, Safe Adoption Line, Observability Line, and Optimization Line.

London Agentic AI matters because it turns the company’s thesis into a shared conversation. Agent Engineering is not only a product category. It is an emerging practice. The best way to define that practice is by bringing builders together and creating space for serious implementation discussions.

Events, Conferences, and Global Engagement

Superagentic AI’s first year was not limited to writing and code. It also included in person events, speaking engagements, community building, and international exposure. The early months included a LangGraph talk at a LangChain hosted London event, participation in London Tech Week, the launch of London Agentic AI, and early conversations with investors, founders, and technical communities. In 2025, Superagentic AI also participated in ODSC AI San Francisco, bringing its agent optimization work to a global AI audience. The company also joined UKAI and participated in discussions around the future of AI in the United Kingdom, including engagement connected to policy, standards, infrastructure, and the Agentic AI era. These activities were important because Agent Engineering cannot be built only through code. It needs shared language, community dialogue, technical education, and trust across builders, researchers, founders, enterprises, and policy groups. Superagentic AI started from London, but the first year made the company increasingly international. The work now connects London, San Francisco, the United States, the United Kingdom, open source communities, technical meetups, and global AI practitioners.

Expanding to the United States

In early 2026, Superagentic AI expanded its operational presence to the United States. The announcement, Superagentic AI Expands to the United States of America, described the next phase of the company’s growth across London and the United States. This expansion matters because the agentic AI ecosystem is global. London is becoming an important hub for AI builders, researchers, and applied AI companies. San Francisco remains one of the most important centers for AI infrastructure, startups, research, and developer ecosystems. Superagentic AI is now positioned across both worlds. The UK and USA presence reflects the company’s ambition to contribute to Agent Engineering as a global discipline, not only as a local startup story.

Agent Engineering Conference: Turning the Thesis Into a Category

The next major step is Agent Engineering Conference. The conference website at agentengineering.world presents Agent Engineering Conference as a dedicated conference for the engineering disciplines behind Agentic AI coding. Agent Engineering Conference is designed to bring together the builders working on prompt engineering, context engineering, harness engineering, eval engineering, memory engineering, skills engineering, guardrail engineering, inference engineering, orchestration, agent optimization, agent experience, coding agents, and production grade agent systems. This conference is a natural extension of the first year of Superagentic AI. The company started by building tools. It then published research, wrote publicly, created open source projects, launched London Agentic AI, spoke at events, and helped shape the conversation around Agent Engineering. Agent Engineering Conference is the next step in turning that conversation into a recognized discipline. Year one proved the foundation. Year two is about helping the category mature.

Building Through Constraints: Solo Founder Journey

The first year was not only about launches and milestones. It also included constraints. Building as a solo founder means carrying the full weight of product, engineering, research, writing, community, marketing, operations, finance, legal, partnerships, and strategy. There is no large team to hide behind. There is no department to pass work to. Every decision, every launch, every event, every line of writing, and every product direction requires focus. There were also personal constraints. As shared in the 2025 year in review, an injury in August forced me to work from bed for several weeks. That period could have slowed everything down completely. Instead, it became a time for deep optimization work, writing, experimentation, and preparation for the next phase. That experience reinforced one of the most important lessons of the year. Building a company of one requires resilience. It requires patience. It requires the ability to keep moving even when conditions are not ideal. The solo founder journey is not romantic every day. It is intense, uncertain, and demanding. But it also creates unusual clarity. There is no room for politics. There is no room for bureaucracy. There is only the work, the mission, the users, the community, and the next meaningful thing to ship.

What Did Not Work

The first year also made one thing clear: building is only half the battle. Shipping code is hard, but explaining why the work matters can be harder. Designing an API is hard, but distribution is harder. Building a product is hard, but earning trust is harder. Writing code can be easier than building community. Creativity can be easier than focus.

As a technical founder, it is natural to keep building. But the market needs more than products. It needs clarity. It needs repetition. It needs education. It needs examples. It needs proof that the category matters. This was one of the most important lessons of year one. Superagentic AI did not only need to build tools. It needed to explain Agent Engineering, educate builders, support communities, and show why optimization, memory, evals, observability, and harnesses are central to the future of agents.

The Solo Founder Spirit : One Person, $x Dollar Company

The first year of Superagentic AI is also a story about what a solo founder can build in the agentic era. A solo founder can register a company, build products, launch open source projects, publish technical research, create a community, host events, speak at conferences, write long form technical content, launch podcasts, build websites, explore global partnerships, and define a category. That does not mean it is easy. It means the tools have changed. AI assisted development, agentic coding, open source, public writing, automation, and community led growth allow one focused builder to move with a level of velocity that would have been much harder a few years ago. Superagentic AI is not proof that teams do not matter. Teams matter deeply. But year one is proof that a solo founder with a clear mission can build the foundation for something meaningful before a large team exists. The solo founder spirit is not about doing everything alone forever. It is about starting with ownership, speed, clarity, and conviction. It is about proving the thesis before asking others to believe in it. It is about turning personal risk into public momentum. That spirit defined the first year of Superagentic AI.

What Was Achieved in Year One

In one year, Superagentic AI moved across several connected dimensions. The product layer grew around SuperOptiX, CodeOptiX, SuperQode, SuperRadar, SpecMem, and related tools. The open source layer expanded into a public GitHub ecosystem with repositories covering agent frameworks, memory, optimization, Quantum AI, harnesses, orchestration, quality engineering, and developer tooling. The research layer introduced SuperOpt for Agentic Environment Optimization and SuperQuantX for Quantum AI exploration. The content layer produced technical writing, product announcements, founder reflections, and educational material across the founder blog and company blog. The community layer grew through London Agentic AI, which became a high signal builder community for Agentic AI Engineering. The events layer expanded through London meetups, technical talks, ODSC AI San Francisco, UKAI engagement, and the creation of Agent Engineering Conference. The geographic layer expanded from the United Kingdom into the United States. Together, these achievements form the real story of year one. Superagentic AI did not only build a product. It built a thesis, a product ecosystem, an open source portfolio, a research direction, a community, and a category narrative around Agent Engineering.

What Year One Taught Me

Year one taught me that agents need engineering discipline. The industry has moved beyond simple demos. Production agent systems need evals, memory, tools, context, protocols, observability, guardrails, and optimization. It taught me that optimization is the hard problem. Choosing a model matters, but optimizing the environment around the model may be just as important. Prompts, memory, tools, retrieval, traces, workflows, and execution paths all shape agent behavior. Here comes Harness Engineering and we already preparing an event in San Francisco to deal with the harness engineering. It also taught me that Agent Experience matters. Agents need systems they can understand, operate, and improve. This is not only a user experience problem. It is an infrastructure problem. In a fast moving field, public repositories, examples, documentation, and transparent experiments help builders understand what is real. Year one taught me that community creates categories. London Agentic AI proved that builders want serious, technical conversations about agents. They want to discuss evals, memory, coding agents, protocols, RAG, observability, safe adoption, and production systems. Year one taught me that a company of one can move fast, but focus is everything. Ideas are easy to generate. The hard part is choosing the right ones, finishing them, explaining them, and connecting them to a larger mission.

From Build Mode to Market Mode

The first year was intentionally build heavy. The priority was to create foundations: products, research, open source projects, content, events, and community. That foundation now exists. The next phase is different. Year two is about turning that foundation into adoption. That means deeper product development, more real world use cases, stronger partnerships, forward deployed engineering opportunities, pilot programs, continued open source releases, and a wider presence across London and San Francisco. Year one proved the foundation. Year two is about applying it with builders, teams, enterprises, researchers, and communities working on real agentic systems.

Looking Ahead: Year Two

The mission of Superagentic AI remains the same: build the infrastructure, tools, research, and community needed for the agentic era. SuperOptiX will continue to evolve as a full stack Agentic AI optimization platform and more focus on the Agent Engineering practices to includes harness engineering, memory engineering and coding agents. The open source ecosystem will continue to expand across memory, harness engineering, agent optimization, quality engineering, orchestration, and developer tools. SuperOpt and SuperQuantX will continue to represent the research frontier across Agentic Environment Optimization and Quantum AI. London Agentic AI will continue to serve builders in the United Kingdom. Agent Engineering Conference will bring the discipline together in London and San Francisco. It’s taking long to grab sponsors but will get there.

The United States expansion will help Superagentic AI connect more deeply with the global AI builder ecosystem. We will be working with ore US based clients in this year.The long term ambition is clear. Superagentic AI wants to help make Agent Engineering a real discipline, not just a phrase. It wants to help developers, researchers, founders, and enterprises build agent systems that are reliable, observable, optimized, and ready for production.

Thank You, All.

Thank you to everyone who has followed, supported, attended, spoken, sponsored, contributed, read, listened, starred a repository, shared feedback, or believed in the mission. Thank you to the London Agentic AI community for showing that serious builders want serious technical conversations. Thank you to the open source community for exploring, testing, and sharing ideas. Thank you to every founder, researcher, engineer, investor, speaker, sponsor, and friend who helped shape this first year.

Thanks You all the companies where I get chance to interact with so far and especially StackOne to gave us an opportunity to build with them. Superagentic AI began as a leap from Apple into the unknown.

One year later, it has become a full stack Agentic AI company, an open source ecosystem, a research lab, a community builder, and a growing voice for Agent Engineering. The agentic era is no longer a future prediction. It is being built now.

Year one was the foundation. Year two is where Agent Engineering moves from idea to discipline.

Key Resources

Company:

https://super-agentic.ai

SuperOptiX:

https://superoptix.ai

London Agentic AI:

Agent Engineering Conference:

Open Source: https://github.com/superagenticAI

Founder Blog:

https://shashikantjagtap.net

Founding Post: Life 3.0: Goodbye Apple. Welcome Superagentic AI

Three Month Update: 3 Months of Superagentic AI: From Vision to Velocity

2025 Review: Year in Review 2025: From Apple to Superagentic AI, Building an Agentic Company of One

SuperOpt Research: Introducing SuperOpt: Research on Agentic Environment Optimization for Autonomous AI Agents

SuperQuantX: Introducing SuperQuantX

Building London Agentic AI. Solo: One Year of a High-Signal AI Community

Shashi Jagtap — Mon, 27 Apr 2026 22:41:51 GMT

Building tech communities remains my passion for long time. I was building communities in London since 2012 if you see my meetup profile, I have built communities in DevOps, Test Automation space but building London Agentic AI was completely different experience. In this post I will try to share few things about how London Agentic AI built from ground up and future plans.

Coming back to the community after six years

Although, I was part of tech community for long time, I joined Apple in July 2019. After spending six years at Apple, I stepped out on 24 April 2025 into a very different world. During those years, I had no real presence in the public community. No social media reach, no visible network in the London tech scene especially in AI, and very little interaction outside of my immediate work environment. Before that, I had been organising meetups in London since around 2012. I had hosted events with different companies, built communities, and spent a lot of time bringing people together. But once I joined Apple, I moved into a completely different environment, and over time I lost touch with that side of my life.

Very next day after leaving Apple, on 25 April 2025, I registered London Agentic AI on the Meetup website. Coming back into the community space after six years felt unfamiliar and felt like a new beginning. It was not just about starting something new, It was about rebuilding from zero in a different field, with a different audience, and without the network I once had. I started by attending meetups across London again, introducing myself, listening more than speaking, and gradually reconnecting with people. I met individuals who were doing serious work in AI and began forming relationships slowly. Those early events and small interactions helped me understand where the space was heading and what was missing.

Why I started London Agentic AI

At that time, there were quite a few events happening in London but none of them were specific to agent AI. I realise that agent is the next big thing in AI. I wanted to start the Meetup around this topic as I named my company Superagentic AI, I thought London agent is the best name for this event and the time the agents have just started to get shapes and the coding agents are just started to arrive so everything at the early stage at that time so that was the right time to kick off this agent AI events in London. I was following the technical conferences and events from San Francisco, but there was nothing similar to Agentic AI in London so I spotted that opportunity and started the Meetup with that nothing was concrete. Nothing was planned. Just wanted to throw on the idea and start something new in the agentic AI space.

The quiet beginning

On 25 April 2025, the day after leaving Apple, I created theLondon Agentic AI meetup. Two days later, on 27 April, I announced it. Around 40 people signed up on the first day. It was a small but meaningful start. After that, I did not immediately run events. For the next couple of months, I was focused on starting my company, building its foundation, and figuring out direction. During that time, people continued to join quietly. There was a steady, organic interest that showed there was something worth building. As I was busy building Superagentic AI and hardly getting any time to host events but people started noticing the Agentic AI boom and I feel there is a need to put something together, I found it pretty challenging to compete with established AI groups in London and justifying how London Agentic AI is different. It was almost end of July and there was no point hosting event in August so event got postponed to September.

The first event: AutogenAI + DSPy and Friends

The first event finally planed in September at AutogenAI, main speaker was Mike Taylor, who I met to talk about the DSPy and it was bit of new topic to London audience and good enough for AI/ML engineerings but hard for software people. We also got Leo who introduced the new concepts to the London AI community. Mike and Leo sets perfect stage for this community and I can’t thanks them enough for their support to promote this meetup group on their network and giving excellent talks in the first meetup. Also huge thanks to the AutogenAI for providing venue for the inaugural event. London Agentic AI was fortunate to have strong speakers who spoke about DSPy and real agents and semantic layer. That session marked a turning point. It set the tone for what the community could become in the future and kind of technical depth it can offer to the community. I think first time ever London AI scene has learned something different and unique as part of the tech events. And the journey started.

Building Community events at Tessl and Databricks on Meetup

After first meetup, momentum has built already, there is new startup Tessl who had excellent event space for hosting tech events. Tessl has sponsored our second event on MCP and we have again amazing talks from Macey and Guillaume from StackOne. We had more than 150+ people in the room and scene was electric on the second meet on 16 October. After that, hosted amazing events on Agent evaluation at Databricks with support of Sultan as my connection from Databricks.

As I was using meetup website since 2012, I was thinking its the best way to host events but I was not aware other tools like Luma came in the market when I was at Apple for 6 years and they are even powerful. However, my old fashioned brain was hesitating to move the community to Luma. At the end of the 2025, I convinced to give a Luma a try and host next event with Google DeepMind on Luma and things changed drastically.

Luma and Unprecedented event with Google DeepMind

First event hosted on Luma with Google DeepMind and the results were shocking. We received 1952 registration for this event for space for 120. I was new to Luma and wasn’t aware of all features. People kept signing up and I kept watching, There was 200 signed up on meetup as well. The situation became touch to choose 100 people from 1925 and some from meetup. Unfortunately, I have to kept that meetup Luma approval only had to tell meetup members that spaces are limited. The lesson learned in the first event hosted in Luma and after that decided not to take registration from meetup as it doesn’t have ticket/pass system and RSVP process is too brittle. Google DeepMind event was epic and huge thanks to Ian, Amit, Ricardo and other panelists from Google DeepMind for making this event amazing of the community.

Bringing Agent Client Protocol to London with Jetbrains and Zed

After amazing events with DeepMind, London Agentic AI community introduced another protocol for coding agents. Agent Client Protocol. We covered the state of the Art from the lead maintainers of the ACP Sergey and Ben. It was fantastic event at Tessl Office. On that day there are other events from Google and Anthropic but still we managed to get more than 150+ people in the room with great questions from audience. Sergey, Ben ad other speakers traveled to London for this amazing event.

From zero to a growing High Signal community: Non Stop

From that point onward, the community started to grow with momentum. Each event brought together people who were deeply involved in building agent systems. Over time, London Agentic AI grew to more than 2,000 members on Meetup and over 2,500 members on Luma, with a combined reach of more than 4,500 people. We hosted events with teams and organisations such as Google DeepMind, Databricks, JetBrains, Tessl, Zed, and AutogenAI. The topics evolved naturally as well, covering MCP, ACP, coding agents, evaluation frameworks, agentic RAG, and production systems.

Scaling the community and moving to Luma

As the community grew, the challenge changed. It was no longer about finding people. It was about managing scale while keeping the quality of the room intact. Most venues in London can host around 80 to 100 people, and demand often exceeded that capacity. Meetup did not provide enough control over attendance, and it became difficult to manage RSVPs or ensure that the right mix of people could attend.

This led to the decision to gradually move events to Luma. It offered better control over registrations, approvals, and access. It made it possible to curate the room more intentionally. At the same time, it introduced a new challenge. Choosing who gets access to a limited number of seats is not easy. There is always more interest than space. I try to bring together a mix of engineers, researchers, founders, and practitioners so that every event feels valuable for both attendees and sponsors. Mainly right gender balance. It is an ongoing process and something I continue to refine.

Launching the website: londonagenticai.com

Alongside this, another problem became clear over time. People kept asking where they could find everything in one place. Events were split across Meetup and Luma, and it was not always easy to keep track. The search feature on both platform is almost broken and load of other events popping up in London with similar themes. London Agentic AI has become high signal event but hard to find. I realised it would be great idea if I can build the website for this so that everything can be at the same place.

To mark the first year, I am launching the official website, londonagenticai.com. The website brings together upcoming events, past sessions, speakers, videos, and an overview of the different areas within agentic AI that the community explores. It is designed to make it easier for people to discover the community and stay connected without searching across platforms.

The people behind it

The past year has been shaped by the people involved. Speakers who shared real work, sponsors and hosts who supported the events, and attendees who showed up, asked thoughtful questions, and kept coming back. The feedback across LinkedIn and other platforms has been encouraging and has reinforced the value of keeping the community focused and technical. I’m so grateful that I could manage to build this community that helped many people to learn from it also build the connections and shape of the entity AI in London. I truly respect all the speakers all the sponsors and all the people who help me organising these events. I managed to make good relationship and friendships with most of the people I worked with during those events.

Building Solo

Over the years, I have learned that I work best building communities solo.. Having co-organisers and co-hosts doesn’t work for me. It often slowed things down or created unnecessary friction. I made a conscious decision to build London Agentic AI on my own, and that decision has worked well. It allows me to move fast, stay consistent, and keep a clear direction for the community. Of course, this does not mean doing everything alone without any support. I have been fortunate to have friends, founders, and people in the community who have helped when needed, often without expecting anything in return. That kind of support has made a big difference. But at its core, I prefer to build as a solopreneur. I plan to continue running the community this way, and carry the same approach into future work, whether it is growing London Agentic AI further, building conferences, or continuing with Superagentic AI. I believe one person, with the right focus and intent, can build something meaningful.

The ecosystem is growing, What Next?

It is also interesting to see how the ecosystem is evolving. There are now other communities emerging with similar themes and even similar names. I see that as a positive sign. It means more people care about this space. There is room for multiple communities, and there is value in growing the ecosystem together in London.

Looking ahead, the focus is on continuing to build depth. More technical sessions, more real-world case studies, and stronger connections with teams working on agent systems across research labs and companies. London has the potential to become a strong centre for this space, and this community can contribute to that in a meaningful way.

Final thoughts

London Agentic AI and my company, Superagentic AI started one day apart. One became a place where I learn in public alongside other builders. The other is where I focus on building. Both are still early, and both continue to evolve.One year in, the community feels established but still at the beginning of a longer journey. There is a sense of direction, a growing network of people, and a clear need for spaces like this. I will try my best to offer something meaningful in the coming year.

Join the network

If you are working in this space or interested in it, you can explore more on the website, join events through Luma, or connect through Meetup. If you would like to speak, sponsor, or collaborate, you can reach out at London Agentic AI

This is just the beginning. Stay Tuned with London Agentic AI

New OpenAI Agents SDK: The Dawn of Extreme Harness Engineering

Shashi Jagtap — Thu, 16 Apr 2026 01:35:35 GMT

OpenAI released a major evolution of its Agents SDK as a fundamental rethinking of how agents should operate in production. They introduced an open, inspectable harness for orchestration and a clean separation from the sandboxwhere real computation happens. For developers tired of building fragile orchestration layers themselves, this feels like the moment the industry finally matured.

At SuperagenticAI, we’ve been obsessed with making agents reliable and optimizable. That’s why we integrated the new SDK into SuperOptiX v0.2.25 immediately. Here’s the story of what this launch truly brings, why it represents “extreme harness engineering,” how it fits alongside ideas like Recursive Language Models’ish and Cloudflare’s work, and what it means for anyone building serious agent systems.
Building agents that go beyond simple chat has always been harder than it should be. You start with a powerful model and good instructions, but the moment the task involves multiple steps, reading files, editing code, running commands, producing real outputs things break. State gets lost. Credentials leak. Containers crash. You end up writing your own control loops, sandboxes, and durability systems.

Most internal agent frameworks at companies looked strangely similar. The community had been quietly converging on the need for better “harness engineering”, a robust layer that lets models focus on intelligence while the system handles reliability. OpenAI just productized that missing piece.

What the New SDK Actually Delivers

The breakthrough is the clean split between two layers.Harness and Sandbox. The harness acts as the intelligent conductor, open and inspectable. It manages the agent loop, tools, tracing, handoffs, approvals, and memory decisions. You now have fine-grained control over when and where memory lives. The sandbox is the safe workspace where the agent does real work. A new Manifest system makes data staging declarative and secure. Here’s a small taste of how clean it looks:

agent = SandboxAgent(
    name="Research Analyst",
    model="gpt-5.4",
    instructions=optimized_prompt,   # from GEPA in SuperOptiX
    default_manifest=Manifest(entries={
        "data": LocalDir(src="./dataroom", permissions="ro"),
        "output": LocalDir(src="./results", permissions="rw")
    })
)

result = await Runner.run(agent, task)

You can mount local folders, git repos, or cloud storage (S3, GCS, Cloudflare R2), set Unix-style permissions, and the agent gets snapshotting + rehydration so it survives crashes and pauses. TemporalIO powers the harness for real durability.On Social, the excitement was immediate. Developers called it “a long-running agent runtime with sandbox execution and direct control over memory and state.” Many highlighted the shift “from simply answering questions to delivering real, tangible work.” The partnership ecosystem (Cloudflare, Modal, E2B, Vercel) got strong praise too.

How It Compares to RLM

Some noticed an “RLM-ish” flavor because of the harness focus. The Recursive Language Models paper (late 2025) showed how to tackle massive context by letting models orchestrate computation externally through recursive calls in a REPL environment. There’s a shared philosophy: move beyond stuffing everything into one prompt.However, OpenAI took a different path. RLM is an inference-time technique for extreme long-context scaling. The new SDK is a production runtime built for durability, security, and real computer use with handoffs and parallel sandboxes instead of recursive self-calls.

Why SuperOptiX Embraced It So Quickly

SuperOptiX was designed for exactly this kind of evolution. In v0.2.25, the new SDK becomes first-class:

Define everything once in clean SuperSpec YAML (with optional sandbox settings).
The compiler automatically generates full SandboxAgent + Manifest pipelines.
GEPA optimization keeps refining instructions, tools, and memory — now running inside durable, permission-controlled sandboxes.
Long optimization runs get free snapshotting and effortless switching between local and cloud environments.

It feels like the final piece of the puzzle clicked. What used to require weeks of glue code is now simple configuration.

The Bigger Shift Ahead

This launch marks a broader move in the industry: models are becoming table stakes, while the harnesses, sandboxes, and durability layers that let them act reliably at scale are the new moat. For builders, the message is liberating. Stop reinventing the control plane. Focus on what truly matters — smarter optimization, better evaluations, and domain expertise.

We’re already working on the next SuperOptiX release with deeper Sandbox style integration and optional RLM-style extensions for extreme context needs. If flaky agents, lost state, or painful production deployments have held you back, this update changes the game.

Ready to experience it? Check out SuperOptiX v0.2.25 and the new OpenAI integration guide. The dawn of truly scalable agents is here.Links

Official Meta-Harness Repo + Packaged Power = Coding Agent metaharness

Shashi Jagtap — Thu, 16 Apr 2026 00:27:56 GMT

Stanford IRIS Lab officially released the reference code for Meta-Harness, their groundbreaking framework for autonomously optimizing the code scaffolding around a fixed large language model. The announcement quickly gained traction across social media, with builders praising the clean ONBOARDING.md workflow and the promise of applying the technique to entirely new domains.

Superagentic AI has been preparing for this day since we open-sourced our own implementation on April 2. Today we are excited to release metaharness v0.2.0 the full official-alignment update that turns the research reference into a polished, installable engine you can start using immediately.

This release is not a port: we kept everything that already made metaharness-friendly, the CLI, filesystem run store with snapshots, write-scope enforcement, experiment matrix support, and strong Codex-first integration, while systematically adopting the strongest architectural ideas from the official Stanford release.

What Meta-Harness Means and Why It Matters

At its core, Meta-Harness flips the usual optimization target. Instead of fine-tuning or prompting the model itself, you treat the entire harness the surrounding code that handles memory, retrieval, validation, tool routing, setup scripts, and evaluation logic as the thing that evolves. A proposer agent iteratively rewrites harness files, evaluates changes on search splits, promotes strong candidates to held-out test sets, and builds a frontier of high-performing variants. This delivers dramatically better task-specific performance without touching the base model.

The official Stanford repo provides an excellent research reference and two worked examples, along with a conversational onboarding flow that makes it easy to bootstrap new domains. Our metaharness v0.2.0 complements that by delivering a fully packaged, provider-neutral runtime that teams can install and run today.

What’s New in v0.2.0

We completed every phase of the alignment plan we outlined internally. The result is a much more powerful yet still familiar library.We added a clean DomainSpec system and a new metaharness onboard command that mirrors the official ONBOARDING.md experience. You can now define your evaluation unit, search and test splits, metrics, budget, and leakage protections through a guided conversation with your coding agent.

The architecture now revolves around a generalized DomainAdapter protocol. This makes it straightforward to plug in custom validation, search-stage evaluation, held-out test evaluation, and secondary metrics. Our existing coding-tool harnesses have been cleanly converted into the first adapter, so all previous workflows remain fully backward-compatible.

Evaluation now properly separates search and test stages to prevent leakage, exactly as the official examples recommend. We also introduced frontier-based selection policies that support single-objective maximization, lexicographic ordering, and Pareto optimization across accuracy, cost, context length, and latency. Batch proposal support lets the engine explore multiple candidates per iteration when you want it, while the simple single-candidate hill-climbing mode you already know is still there by default.

Telemetry received a major upgrade too. ProposalResult now captures detailed token usage, cost tracking, file read/write summaries, tool-call traces, and richer session metadata inspired by the structured logging in the official Claude wrappers but kept fully provider-neutral.

Finally, we included a lightweight reference domain modeled after the official text-classification example, so you can see the full pattern in action without heavy external dependencies. You can read the complete release notes here:

How the Two Repos Work Together

The official Stanford repository shines as a research foundation and domain bootstrapping toolkit. It excels at helping you define new problems and replicate the paper experiments. Our metaharness library complements it perfectly by providing the production runtime: a single installable package with a stable CLI, battle-tested Codex and local Ollama support, inspect and ledger commands, and filesystem persistence that survives long optimization runs.

Together they form a complete picture research-grade concepts paired with shipping-grade tooling.

Get Started in Under a Minute

# Install via uv (recommended)
uv tool install superagentic-metaharness

# Try the built-in example
metaharness run examples/python_fixture_benchmark --backend fake --budget 5

# Or start a brand-new domain
metaharness onboard

What Comes Next

The age of manually hand-tuning harnesses is ending. The age of self-optimizing, inspectable, frontier-driven harnesses is here and it is fully open source.Drop your harness into metaharness, let the outer loop run, and watch it improve. We can’t wait to hear what results you get.

Full documentation is live at
https://superagenticai.github.io/metaharness/ Repository: https://github.com/SuperagenticAI/metaharness

Open Memory and Open Harness Is Not Enough: You Need Self-Optimizing (Self-Healing) Harness

Shashi Jagtap — Mon, 13 Apr 2026 06:53:58 GMT

Recently there is a lot of discussion on agent Main and harness initially anthropic put a blog post on scaling managed agents which gone viral is and replied to that lunch and CEO Harrison Chase post the blog post on your harness your memory that start the debate over social media whether Agent memory should be decoupled from the harness and ownership matters. In this post, we will se what’s missing in the both debates both post claims that agent memory and agent harness should be decoupled and should be treated as the separate entity however both missed that even if you put the agent memory in a separate containers agent memory needs optimisation too.
Harrison Chase nailed it with “If you don’t own your harness, you don’t own your memory.” Agent harnesses are the real product now. Markdown files everywhere AGENTS.md, SKILLS.md, CLAUDE.md open memory layers, open tool schemas. The industry is finally moving toward ownership and away from vendor lock-in. Totally agree that Openness is necessary. But it is not sufficient.

Even a perfectly open harness becomes brittle the moment you switch models, change providers, or let real-world usage evolve. The OpenClaw vs. Anthropic drama (April 4–6, 2026) proved this publicly and painfully. Self-optimizing, self-healing harness engineering is the missing layer.

The OpenClaw Drama: A Warning for Every Agent Builder

When Anthropic tightened restrictions, teams had to switch models. Hyper-optimized prompts, tool schemas, retrieval logic, and memory compaction, all tuned for Claude was crumbled. This wasn’t bad luck. It was the inevitable result of treating the environment as static. As we wrote in the previous post: the fragility came from deep optimization for one closed model. Vector DBs, work trees, and plain Markdown gave storage and version control but zero automatic adaptation. Even sophisticated RMLs (Retrieval Memory Layers) with typed episodic/semantic/procedural memory, confidence decay, and conflict resolution stay static unless the harness itself optimizes them.

Why Open Memory + Open Harness Still Breaks

Our latest Meta-Harness post hit the exact point: even with strong memory layers, you need an optimization memo that updates based on how this specific model behaves today.

Traditional tools fall short:

Vector DBs → great retrieval, blind to model drift in embeddings or reranking.
Work trees → versioned but not failure-aware or model-sensitive.
Markdown contexts → portable and readable, but explode without automated restructuring and compaction.

Every provider tweak forces manual prompt surgery. Not scalable.

The Core Idea: Treat the Entire Agent Environment as the Optimization Target

This is the heart of SuperOpt (December 2025 paper). We flip the paradigm. Keep the model frozen. Make the full environment (prompts, tools, RML configs, memory schemas, validation logic, filesystem instructions) the mutable optimization target:Failures are diagnosed and turned into Natural Language Gradients (NLGs) — human-readable patches derived from execution traces.⁠Super-agentic

A SuperController (Meta layer) routes issues to specialized engines:

SuperPrompt → evolutionary instruction optimization
SuperReflexion → self-healing tool schemas
SuperRAG → adapts RML parameters (top-k, search mode, reranking, compaction)
SuperMem → typed memory with decay, conflict resolution, and stability enforcement

The harness doesn’t just remember. It heals itself from failures, converges without oscillation, and ships portable optimization memos with the agent.

MetaHarness in Action

On challenging Aider-style coding tasks, SuperOpt lifted success rate from 90% → 100% and made execution 1.6× faster all with the model frozen. MetaHarness makes this real and open-source. It treats your entire workspace instructions, setup scripts, RML configs, validation logic is an optimizable harness.

It runs an outer loop:

Start with baseline workspace
Let a coding agent propose changes
Rigorously validate + evaluate
Keep best candidate with full evidence and snapshots

Works today with Codex provider.

On Memory and Harness

Harrison’s thread and replies are full of people asking for experiential memory that learns, active forgetting, consolidation, and adaptive procedural routing. A self-healing harness delivers exactly that. Open memory gives ownership. Self-optimization gives resilience and continuous improvement.

The Future Belongs to Harness Owners Who Can Self-Optimize

Self-optimizing harnesse is the real product. In the coming months, winning teams will ship agents that auto-adapt to new models, new tools, and new realities without constant human surgery. If you’re deep in Markdown madness, building RML-powered agents, or fighting harness brittleness, try MetaHarness today. Plug in your existing memory layers and watch the optimization loop take over.

Links:

SuperOpt Paper: https://super-agentic.ai/papers/SuperOpt.pdf
MetaHarness GitHub: https://github.com/SuperagenticAI/metaharness

The era of static (even open) harnesses is ending. The era of self-healing, self-optimizing agent environments starts now. Let’s build it.

What OpenClaw Vs Anthropic Drama Taught Us: The Urgent Need for Self-Optimizing Harness Engineering

Shashi Jagtap — Mon, 06 Apr 2026 09:35:44 GMT

Recently, OpenClaw took off like a one of the greatest breakthrough in the AI. People are going crazy to setup OpenClaw to automate the tasks. All looked very good until recetly when Anthropic cut off access to Claude models for OpenClaw. Anthropic changed the rules for Claude Code subscribers. Third-party tools such as OpenClaw can no longer use included subscription tokens. Users must switch to pay-as-you-go API pricing or move to Anthropic’s own first-party tools. A one-month credit was offered to existing users, but the cutoff took effect on April 4 2026. The reaction across all social platform was immediate and intense. Power users who had built complex autonomous workflows suddenly faced costs that made their setups unsustainable. Many tried swapping in GPT models, Gemini, or local open-source alternatives. The results exposed a deeper problem.

Why OpenClaw Didn’t work on Other Models?

OpenClaw’s entire architecture had been tuned specifically for Claude. Its prompting strategies, context management, tool-calling loops, memory compaction, and orchestration logic all assumed Claude’s particular strengths. When those assumptions broke, performance collapsed. Tasks that once ran smoothly now failed or hallucinated heavily. The fragility was not a bug. It was the direct result of hyper-optimization for a single closed-source model. There are some indication of this

OpenClaw is built with Typescript, there Claude models are absolutely stunning to deal with the TypeScript
Harness Optimization packages and libraries doesn’t exist in TypeScript so once harness has built it stays static
There were no Prompt, Context optimization strategies mentioned in the OpenClaw that can be optimized automatically for any models.
The responsibility seems like handed over to the models reasoning than the actual harnesses itself. Models takes control over harness.

Hence, OpenClaw seems to work amazing on the Claude Models but seems to suck at the other model including the GPT models from OpenAI where the author currently work.

The TypeScript versus Python Reality Check: OpenClaw vs Hermes Agents

OpenClaw was written in TypeScript which lived deep inside the Anthropic ecosystem. That choice delivered excellent results while Claude remained the top performer for coding and reasoning. The same harness felt clumsy and “off” when users tried GPT or Codex models. OpenAI’s models come from a Python-native lineage with different tool-use patterns and code-generation behaviors. Switching required far more than a simple API key change. It demanded fundamental adjustments to the harness itself.

At the same time, Hermes Agent and similar Python-based frameworks began gaining traction. Their architecture aligns more naturally with GPT workflows, persistent memory loops, and self-improving evaluation layers. Many users started calling these alternatives “it just works” options because model switching required far less rework. The contrast highlighted a structural tension inside OpenAI itself. The company had acquired OpenClaw’s TypeScript codebase through the creator’s hiring while also owning deep Python tooling via other moves. Reconciling those two worlds became an internal priority. OpenAI acquired Astral, the python tooling startup which show they are focusing on Python ecosystem not TypeScript.

The Open-Source versus Closed-Source Debate Ignites

The pricing change triggered a loud and immediate debate on social media about open source versus closed source. Many users argued that the moment proved why open source must win. They urged OpenClaw to focus on local models and fully self-hosted setups. One post captured the sentiment directly: “If we ‘never bet against open source’, then it means OpenClaw should simply be used with local models. Others declared that Anthropic’s move had accidentally accelerated hybrid architectures where cheap local models handle execution while a stronger cloud model orchestrates. Several users announced they would now invest time in local-model hosting rather than pay Anthropic’s new rates.

The debate carries an obvious contradiction. OpenClaw’s creator, Peter Steinberger, recently joined OpenAI. Multiple posts note that he had publicly criticized Anthropic and accused the company of copying features from OpenClaw before restricting access. One widely shared thread described the timing as “Anthropic just cut off Claude Code subscribers from third-party tools like OpenClaw. No migration path. Pay-as-you-go or nothing. OpenClaw’s creator joined OpenAI. Anthropic moved on pricing the same week. Timing writes its own story.” Another highlighted Steinberger’s earlier statements that Anthropic had copied features into its closed tool before locking out the open-source alternative.

Anthropic, OpenAI, OpenClaw

For the open-source community that had embraced OpenClaw, the situation feels equally contradictory. The framework is positioned as open and customizable, yet its creator now works at a leading closed-source lab. Users who favor local models still face a harsh reality: today’s best open-source models lag significantly in reliable tool calling, long-horizon reasoning, and consistent multi-step orchestration. Hybrid experiments are growing, but pure local performance remains a work in progress. Meanwhile, making GPT models perform well inside OpenClaw requires substantial engineering effort. Some observers suggest the entire harness might eventually need a port to Python to align with OpenAI’s strengths. The debate on X has therefore split into two camps: one demanding faster open-source integration, the other acknowledging that closed-source models still deliver the highest immediate capability.

Anthropic’s defenders point out that the company is simply enforcing its terms and protecting its infrastructure from subsidized usage patterns that grew faster than expected. They are not blocking API access entirely. The change forces third-party tools to pay realistic rates. Critics counter that the timing and the simultaneous rollout of native Claude Code features (recurring prompts, scheduled tasks, persistent memory, remote control) look like a deliberate shift toward first-party lock-in. Either way, the episode has made one fact undeniable: relying on a single provider’s goodwill is risky when your entire product depends on that provider’s model.

Has Anthropic done anything Wrong? Or Is Peter Playing Double Game?

Has Anthropic done anything wrong in this situation? I do not think so. Anthropic did not do anything wrong here. They are simply trying to stop the abuse of their most powerful models, which they produce for the entire world. They did not block OpenClaw directly. People can still use OpenClaw with Claude models by switching to the official API and paying the standard usage rates. Previously, many users were heavily abusing the fixed-price Claude subscription to power intensive agent workflows in OpenClaw far beyond what the subscription was designed for. This became especially relevant as OpenClaw’s author, Peter Steinberger, joined OpenAI, a direct competitor, and began actively promoting GPT and Codex models while making critical comments about Anthropic and Claude Code. From Anthropic’s perspective, this was a reasonable competitive response to protect their business model. Steinberger first used Claude Code and switch to Codex (he claimed), it came immediately after Anthropic blocked Open Code and Previous OpenClaw ClaudeBot. He was also running meetup on Claude Code so that proved he was active Claude Code user and his switch to Codex is understandable after Anthopic blocked access to ClawdBot.
This situation feels like a double game: keeping OpenClaw positioned as open source while aligning closely with a closed-source lab. If the goal was true independence, staying neutral rather than joining OpenAI would have been a clearer path. In the end, Anthropic simply played their game. Claims that they “blocked” access seem designed more to generate sympathy and attention than to reflect the reality that API access remains fully available, at market rates. I hope this move encourages healthier, more sustainable practices across the agent ecosystem.

OpenClaw’s Fork in the Road: Port OpenClaw to Python

OpenClaw now faces two clear but difficult paths. The first is deep integration with GPT and other OpenAI models. Steinberger’s new role inside OpenAI may accelerate that work, but it still requires rewriting large sections of the TypeScript harness. The second path is full support for local and open-source models. That route demands solving the current gaps in tool-calling reliability and reasoning depth. Neither option is trivial, and the window for execution is narrow. Competing agent frameworks continue to ship new capabilities every week.

One of the clear option is port the OpenClaw to Python and make it compatible to the OpenAI ecosystem. Porting OpenClaw will take benefit from the great tools like DSPy, GEPA, ACE,,meteaharness and other recent breakthrough from the AI/ML world in the coming future. This will align with OpenAI’s Python ecosystem and vision to continue shipping amazing Python and Rust based tooling.

Other option is to bet on the open source models in the reasoning and tool calling capabilities which might be closer but still the static harness won’t work for all the model family and providers. Self-Optimizing harness is a need not an option in any case.

How Is OpenClaw Getting Fixed? Recent Reactions

In response to the widespread user frustration, Peter Steinberger has been actively working on GPT integration. On April 5 2026 he posted that he “made GPT really good today and switched.” He noted that Claude Opus is still funnier but that GPT is now more reliable. In the same conversation he confirmed he had improved the harness, prompts, message tracking, and parsing logic. He encouraged users to test the changes immediately by running openclaw update --channel dev and asked for feedback directly in the thread. When users reported issues (such as execution failures or case-sensitivity problems with file systems), Steinberger responded quickly with further tweaks and debugging suggestions like enabling /verbose mode. He also addressed personality complaints by stating he had already fixed them.

These are meaningful short-term improvements. Users who update to the dev channel are already seeing better GPT performance, and the community appreciates the rapid iteration.

However, these fixes are not enough. Steinberger is currently optimizing the harness and prompts specifically for the current GPT model. When the next GPT release or a new open-source model is used inside OpenClaw, the entire harness will likely break again. Simply patching the prompt and tweaking the orchestration layer for one model at a time is not a sustainable solution. What OpenClaw needs is a truly self-optimizing harness that automatically detects model capabilities and adapts its prompting, context handling, tool routing, and evaluation logic without manual intervention. The current work is a patch, not a fix. Users should be aware that this can break anytime a new model releases. Harness optimization remains the real key.

Harness Engineering Is the Real Moat

The clearest lesson from this entire episode is that harness engineering matters more than any single model. A harness is not just glue code. It is the orchestration layer that turns raw model intelligence into reliable, repeatable work: context management, evaluation loops, memory compaction, tool routing, error recovery, and lifecycle handling. When that layer is tuned exclusively for one model’s quirks, the entire system becomes brittle the moment that model’s economics or availability change.

The winning architectures of the next twelve months will treat models as swappable engines rather than fixed chassis. They will include pluggable adapters that auto-detect model capabilities and adjust prompting, caching, and evaluation automatically. They will separate generator and evaluator roles. They will manage context intelligently instead of blindly compressing it. Research into meta-harnesses and natural-language agent frameworks is already moving in exactly this direction. Users on X repeatedly emphasize the same point: the harness itself is the product.

That insight extends beyond any one framework. Builders should not tie their workflows permanently to OpenClaw, Hermes, LangChain, or any other single tool. Instead they should learn harness engineering fundamentals so they can assemble and optimize their own pipelines. When a new model arrives with better reasoning, faster tool calling, or longer context, the harness should adopt those capabilities within days, not months. The ability to switch without rewriting core logic is the new competitive advantage.

OpenClaw did not fail in optimization because it was poorly built. It succeeded so well with Claude that it became dangerously dependent on one model family. That dependency just became expensive. The next generation of agent tools will avoid the same trap. They will be ready for whatever model, Claude, GPT, Gemini, Llama, or the next surprise, drops next month. Because in the agent era the harness is not infrastructure. The harness is the product. And the best harnesses will never bet on any single king. They will be engineered to welcome the next one.

If you are really interested in the Harness Engineering then, you should sign up for the event “Harness Engineering: State of the Art in Agent Harnesses” in San Francisco dedicated to the Harness Engineering and stay up to date. Speakers will be announced soon. Stay Tuned..
Harness Engineering is here to stay longer! Keep Harness Engineering..