Co-Developing an AI Native Observability Platform 

observability, 2.0, developers, observability, datadog, your, observability, customers, blind spots, telemetry, New Relic, Observe, Gen AI, Generative AI, modern, applications, risk, observability, AI, unified observability, binoculars

As AI capabilities continue to evolve, AI is becoming central to managing the growing complexity of distributed, hybrid enterprise environments, enabling more effective analysis, correlation, and automation across interconnected systems.

Traditional infrastructure and specifically network monitoring approaches, often built around siloed tools and static thresholds, struggle to keep pace with the scale, velocity, and interdependencies of modern systems. Further blurring the boundaries between network, application, and infrastructure domains makes it harder to isolate root causes and maintain operational resilience. In this context, AIOps platforms have emerged as one response to the growing need for integrated observability, automation, and data-driven decision-making.

At AI Field Day, Selector AI presented an AIOps platform, which can be considered a foundation for co-creating more adaptive and data-driven network operations. Rather than positioning it purely as a product choice, it embraces the SaaS approach, considering professional services as part of the offering, coupled with the product features and encouraging a co-development approach towards the platform instance for customers.

Demonstrations from the AI Field Day highlight the capability of full-stack observability with a data-centric approach, where data becomes the core part of the stack. Selector’s strength lies in its data-centric foundation, ingesting diverse, multi-domain sources metrics, logs, configs, alerts, and topology into a unified analytics layer. Unlike model-first tools, it prioritizes raw telemetry correlation via ML before layering AI, creating a “single source of truth” that slashes alert fatigue and supports hybrid/cloud environments without siloed dashboards. A unified data approach combined with the co-development of the specific platform instance, provides a more deterministic way of visualizing the problem, combined with causal analysis, which can help identify root causes more efficiently.

Another strategic feature embedded within the platform is the Network Language Model, built on a fine-tuned Model with vast networking telemetry data, which bridges natural language queries to complex ops tasks. It understands domain-specific terms (e.g., interface connectivity, routing paths) and powers Slack/Teams chats. This capability provides Selector AI advantage to advance beyond basic observability with AI agent workflows that enable autonomous, explainable network operations via its agent framework. These agents leverage Retrieval-Augmented Generation (RAG) with the Network Language Model to process unified telemetry, then trigger actions.

Striking comparison with building an in-house platform offers customization but typically requires significant investment often on the order of 18 to 24 months, multimillion-dollar budgets, and dedicated engineering teams. In addition, ongoing maintenance can increase as telemetry volumes grow and machine learning models require continuous tuning. Over time, internally developed systems may accumulate technical debt, particularly if they struggle to keep pace with evolving data and operational complexity. In contrast, purchasing a platform such as Selector, where organizations can engage in a co-development approach, may reduce initial development effort and accelerate deployment, with integrated capabilities like cross-domain correlation, incident summarization, and extensibility through a partner ecosystem. Another highlight is that the Selection instance is per customer, which means that there is no need for additional overheads.

The role of AI in operations is also evolving. Rather than optimizing capabilities like site reliability, the approach tends to shift the focus toward higher-level validation and decision-making. This creates a collaborative model where human expertise, operational data, and machine intelligence reinforce each other.

Adoption of such platforms often benefits from a phased approach. Initial efforts may focus on a limited proof of value, targeting a small number of critical services to measure improvements in alert reduction and incident response times. Subsequent phases can expand telemetry ingestion, introduce agentic workflows, and automate routine operational tasks, supported by cross-functional governance structures. Over time, organizations may extend capabilities toward predictive operations, capacity planning, and broader automation, while continuously evaluating outcomes against defined performance and cost metrics.

Lastly, co-creation allows for the customization of AI models and analytics to fit unique customer needs. This “mass customization” enables teams to create specific actionable insights rather than relying on generic, “one-size-fits-all” heuristics, according to Selector’s blog.  These elements combine for agentic, self-healing networks, aligning with AI Field Day 8 themes of production-scale inference and infrastructure evolution.

From a strategic perspective, platforms like Selector can be viewed less as a standalone product and more as enablers of operational evolution. The long-term value depends on how effectively organizations integrate them into their workflows, align them with business objectives, and build internal capabilities around them.

from DevOps.com https://ift.tt/ngtr8b9

Why the Software Development Tools you Choose Directly Affect Your CI/CD Reliability

Most conversations about CI/CD reliability start in the wrong place. Teams debug flaky pipelines, investigate intermittent failures, tune alerting thresholds and optimize build times. All of that work is legitimate. However, the decisions that most directly determine whether a CI/CD pipeline is reliable or not were made months or years earlier, during tool selection. By the time teams are debugging pipeline reliability, they are usually dealing with the downstream consequences of upstream decisions that seemed reasonable at the time. The software development tools a team chooses shape their CI/CD pipeline in ways that are not always visible during evaluation. Understanding those connections is the most practical starting point for teams that want reliable pipelines rather than better pipeline firefighting. The Integration Surface Problem Every tool in a software development stack creates an integration surface. Integration surface is the set of connections a tool has with oth...

News and Tech Update

Search This Blog

Co-Developing an AI Native Observability Platform

Labels

Comments

Post a Comment

Popular posts from this blog

Why the Software Development Tools you Choose Directly Affect Your CI/CD Reliability

Fox News Breaking News Alert