Ornith Models Automate Agentic Coding With Self-Scaffolding

Ornith, a new family of open source LLM models from the DeepReinforce research collective, takes a novel approach to executing coding and debugging tasks: It generates an architectural framework to give the user’s harness a structured instruction set – a scaffold – to create an agent to complete the job.

Available in a set of four variants, the Ornith family was trained to work comfortably with complex software repositories undertaking complicated long-horizon jobs. Sure, LLMs can do these tasks now – until the job gets too complex. Ornith’s self-generated scaffolding ensures that it doesn’t forget the plot along the way.

“The model continuously improves not only its code generation abilities but also the orchestration strategy used to solve software engineering problems,” wrote AI tutorial engineer Mehul Gupta, in an introductory post.

Deep Reinforcement Expansion Pack

Ornith reads the user’s instruction, but instead of executing it directly it builds a scaffold, a learnable object. The scaffold serves as a place where Ornith can design –and refine – the architecture for the job.

According to Gupta, the scaffold is where the LLM can detail the reasoning sequences, memory organization, debugging strategy, tool invocation order and execution planning. The user’s harness then interprets the scaffold to generate an agent to execute the task.

When the job is finished, the scaffold is deleted. When a new task comes up, Ornith builds a fresh scaffold to execute that job.

“By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions,” the researchers state in a post.

Ornith builds the scaffolding from a set of rules developed in the model during training time. These models were built from an exhaustive self-learning process that used deep reinforcement learning techniques to computationally rotate through the possible ways of addressing an issue.

Four Models

Ornith’s four variants are: 9B Dense, 31B Dense, 35B MoE and 397B MoE. The “Dense” models activate every parameter (as measured by the “B” in the name), whereas the MoE (Mixture of Experts) models activate only the parameters needed based on their relevance for the task, though they have additional reasoning tools for specialized functions.

Each of the variants are built atop the open source Gemma 4 and Qwen 3.5, allowing the researchers to layer coding-specific deep RL rules over those models’ inherent fluency in language and world knowledge.

The dense models are best suited for running on local hardware. Ideal for a laptop, 9B Dense can write small scripts and execute various single-file cleanup tasks, whereas the 31B Dense requires a full workstation with up to 48GB of VRAM, but can internalize a full view of a complicated multi-file repository for tougher problems.

The MoE variants are best run in the cloud. The 35B MoE is perhaps best suited for quick continuous integration patching and code review. The 397B MoE is the flagship model, a competitor to Opus 4.7, in the organization’s estimation. This behemoth requires a cluster of GPUs to run smoothly, and can tackle the hardest coding problems.

Killer Performance

With this diversity of models, Ornith’s performance metrics are “just killing it all over the place,” with impressive marks across small, middle and large LLM categories, observed the Hyderabad, Telangana-based Data Science in Your Pocket YouTube channel. This is “a breakthrough … one of a kind model,” they noted.

In company tests, Ornith-1.0-397B outperformed Claude Opus 4.7 on Terminal-Bench 2.1, a benchmark for LLMs in terminal environments, scoring 77.5 to Claude’s 70.3.

Likewise, Ornith-1.0-35B significantly outperforms similar mid-sized models, including Qwen 3.5 (9 billion parameters) and Gemma 4 (12 billion parameters). It even rivaled the 31-billion-parameter Gemma 4 model.

from DevOps.com https://ift.tt/HLWShxE

Why the Software Development Tools you Choose Directly Affect Your CI/CD Reliability

Most conversations about CI/CD reliability start in the wrong place. Teams debug flaky pipelines, investigate intermittent failures, tune alerting thresholds and optimize build times. All of that work is legitimate. However, the decisions that most directly determine whether a CI/CD pipeline is reliable or not were made months or years earlier, during tool selection. By the time teams are debugging pipeline reliability, they are usually dealing with the downstream consequences of upstream decisions that seemed reasonable at the time. The software development tools a team chooses shape their CI/CD pipeline in ways that are not always visible during evaluation. Understanding those connections is the most practical starting point for teams that want reliable pipelines rather than better pipeline firefighting. The Integration Surface Problem Every tool in a software development stack creates an integration surface. Integration surface is the set of connections a tool has with oth...

News and Tech Update

Search This Blog