Microsoft Field Engineers Built a Six-Agent Research Pipeline in VS Code That Fact-Checks Its Own Output

A customer deploys AKS in a regulated environment, hits an issue during node bootstrapping, and wants to know exactly what happens when a node joins the cluster. The question sounds simple. The answer is spread across the AgentBaker source code, the cloud-provider-azure module, a Microsoft Learn article, three abstraction levels above what actually runs on the node, and the institutional knowledge of a teammate who may or may not be online.

That’s the daily reality for Microsoft’s Global Black Belts — field engineers handling deep technical questions about Azure Kubernetes Service (AKS) and Azure Red Hat OpenShift (ARO). Two of them, Diego Casati and Ray Kao, built a system that does what they do: retrieve, correlate, verify, and write up the answer. They call it Project Nighthawk.

What Nighthawk Does

Nighthawk is a multi-agent research system built inside VS Code with GitHub Copilot. You type a command like /Nighthawk how does AKS implement KMS encryption with customer-managed keys? and it produces a fact-checked, source-cited technical report in markdown. Not a summary of what a language model remembers from training data. An actual investigation — source code read, official documentation consulted, claims verified against cited sources, findings written up with Mermaid diagrams where they add clarity.

The kind of report a senior engineer would produce after two hours of focused research, delivered in a fraction of the time.

Behind that single command is a six-agent pipeline: Orchestrator coordinates the workflow. Classifier determines which service the question targets (AKS or ARO) and what type of question it is (architecture, bug, or guidance). Researcher agents search locally cloned repositories and cross-reference Microsoft Learn documentation through the MCP Server. Synthesizer writes the structured report. FactChecker validates every claim against the cited sources.

The output lands in a notes/ directory as a markdown file with a consistent structure: A TL;DR with the direct answer, a technical deep dive, key findings, reference tables with Microsoft Learn links and GitHub file paths, and a fact-check summary showing how many claims were verified versus flagged.

Why Grounding Matters More Than Model Intelligence

The paper’s core insight — and it generalizes well beyond Azure field engineering — is that the problem with asking AI to research technical topics isn’t model capability. It’s grounding.

LLMs synthesize patterns from training data. Azure is a moving target. A model trained six months ago may confidently describe a code path that was refactored in the last release. For general background, that’s usually fine. For precise, version-specific answers that field engineering demands, it’s not.

Nighthawk’s solution: researchers operate against locally cloned repositories pulled fresh before each run. The model works against the actual current state of the codebase. Source code is one input. Official documentation via the Microsoft Learn MCP server is another. Release notes are another. The researcher correlates all of them and surfaces any conflicts that exist.

This is the same grounding principle behind the Azure DevOps Remote MCP Server — giving agents structured access to live data rather than relying on recall of training data.

The Architecture Decisions That Matter

Several design choices in Nighthawk apply to any team building internal agent systems.

Agent separation over monolithic prompts. Six agents with distinct responsibilities rather than one agent trying to do everything. The Classifier doesn’t research. The Researcher doesn’t write reports. The FactChecker doesn’t synthesize. Each agent is scoped, and the quality of the final output depends on that separation. This follows the Agent Handoff Pattern from Azure’s Architecture Center — in which specialized agents complete distinct tasks and pass results through well-defined contracts.

VS Code agent skills for workflow knowledge. Rather than encoding operational knowledge in system prompts, Nighthawk uses markdown skill files that agents read at the start of each run. The Nighthawk-LocalRepos skill tells researchers which repositories exist, what each covers, and why pulling the latest is mandatory. The Nighthawk-ReportTemplates skill gives the Synthesizer the exact structure for each report type. Skills load on demand rather than consuming context window space permanently — the same pattern VS Code agent plugins use for distributing tool configurations.

Fact-checking as a separate pipeline stage. The FactChecker reads the finished report and validates each claim against cited sources. Verified claims get a checkmark. Unverifiable claims get flagged with a count, so the engineer sharing the report knows exactly where to look. The agent who writes the report isn’t the one who validates it.

“Project Nighthawk makes a structural argument for how agent systems earn reliability. Separating classification, research, synthesis, and fact-checking into distinct agents creates independent accountability at each stage, treating verification as a first-class pipeline stage rather than a quality assumption baked into the model,” according to Mitch Ashley, VP and practice lead for software lifecycle engineering at The Futurum Group.

“Teams deploying agents against fast-moving codebases face the same grounding problem. Agents that synthesize from training data produce confident answers that age poorly. Grounding against current source files and validating output through a dedicated stage positions reliability as an architectural commitment that cannot be deferred.”

Why This Matters for DevOps

Project Nighthawk is narrowly scoped — AKS and ARO questions, two field engineers. But the architecture is general. Any team that answers deep technical questions against a codebase that changes regularly faces the same problem: Knowledge scattered across repos, documentation, changelogs, and people’s heads. The cost of assembling a reliable answer is high. And the answer usually lives in a Teams thread that ages out in a week.

The six-agent pipeline addresses each stage of research separately. Classification routes to the right knowledge domain. Research retrieves evidence from current sources. Synthesis structures findings consistently. Fact-checking validates output before it’s shared. Each stage can fail independently, and failures are visible — a flagged claim is a signal, not a hidden error.

The grounding architecture — locally cloned repos pulled before each run, plus MCP-mediated access to documentation — is a practical alternative to RAG pipelines for fast-moving codebases. No embedding pipeline to maintain, no vector database to keep in sync. The agent reads the actual files. The tradeoff is that it works best for repositories small enough to clone locally, but for focused domains, that’s manageable.

For DevOps teams building internal knowledge systems, Nighthawk demonstrates a pattern worth studying: Don’t ask the model what it knows. Direct it to look at what’s actually there. Structure the output consistently. And validate the result before it reaches the customer.

Project Nighthawk is detailed in a blog post on the All Things Azure DevBlog.

from DevOps.com https://ift.tt/hpy8jlq

Java 26 Arrives With AI Integration and a New Ecosystem Portfolio — What It Means for DevOps Teams

Oracle released Java 26 on March 17, 2026, and while every six-month release comes with its own set of improvements, this one carries a broader message: Java isn’t just keeping pace with the AI era — it’s actively positioning itself as the infrastructure layer where AI workloads will run. For DevOps teams managing large Java estates, that’s worth paying attention to. The Scale of What You’re Already Running Before getting into what’s new, it helps to remember what’s already in place. According to a 2025 VDC study, Java is the number one language for overall enterprise use and for cloud-native deployments. There are 73 billion active JVMs running today, with 51 billion of those in the cloud. That scale matters when you’re thinking about where AI fits in. Most of the systems where agentic AI will eventually operate — transactional platforms, backend services, data pipelines — are already running on Java. The question for DevOps teams isn’t whether to adopt Java for AI. It’s how to ...

News and Tech Update

Search This Blog