The Brief
A monitoring estate
nobody fully
understood.
TD Bank's monitoring organization came to us with a clear problem and a complex landscape. Across four infrastructure layers, L1 through L4, two physical data centers in Scarborough and Barrie, and multiple lines of business, dozens of monitoring tools were running in parallel. Many were performing overlapping or redundant functions. Few were connected to each other. The same underlying incident could trigger independent alerts simultaneously across multiple layers and teams, none of whom had visibility into what the others were seeing.
The result was a NOC and operations function buried in alert volume, spending significant time triaging noise rather than responding to real incidents. In a bank processing billions in daily transactions, delayed incident response carries direct financial and compliance consequences. The brief was to architect an AI-powered pipeline that could ingest alerts from across this fragmented estate, correlate them intelligently, and surface a single actionable incident story to operators, instead of dozens of disconnected alerts.
The Discovery
Mapping what
nobody had
mapped before.
Before IBM AIOps could correlate a single alert, we needed a clean, accurate map of every Configuration Item across the estate; who owned it, what was monitoring it, and how it related to everything else. That map didn't exist. It lived partially in ServiceNow's CMDB, partially in the heads of SMEs across multiple teams, and partially nowhere at all.
We structured discovery across all four layers. L1 was relatively cooperative; a small group of IBM Netcool specialists could walk us through the foundational alerting architecture. L2 and L3 became progressively more complex, requiring us to work with TD leadership to identify and engage three to four SMEs per layer across the primary lines of business. Getting the right people to the table required both executive support and a deliberate effort to create psychological safety in the sessions; these were conversations that surfaced gaps, redundancies, and ownership ambiguities that teams hadn't previously been asked to expose.
The CMDB work was particularly intensive. We downloaded CI ownership lists from ServiceNow Discovery, cleansed the data, and went back to LOB SMEs to validate and correct it. There were instances where the CMDB was simply wrong; ownership had changed, tools had been decommissioned, or new infrastructure had been added without corresponding CMDB updates. Every discrepancy had to be resolved before it could feed the topology model.
"Getting the right people in the room is harder than building the pipeline. The AI is only as good as the topology feeding it."
We also encountered something that made the problem significantly harder: duplicate monitoring within single lines of business. The same CI was being monitored simultaneously by Datadog and Dynatrace, with both feeding into Splunk and reporting through Grafana. This wasn't negligence; it was the result of different teams acquiring different tools independently over time, with no central governance to rationalize the estate. Before AIOps could deduplicate alerts across teams, we first had to eliminate redundant monitoring within them.
The Hard Calls
Three decisions
that shaped
the outcome.
The Architecture
From fragmented
noise to a single
alert story.
The IBM AIOps pipeline operates in several stages. Alerts from all monitoring tools are ingested and normalized into a common schema, so the system can reason across sources that previously spoke different languages. The cleaned CI topology from ServiceNow feeds a dynamic infrastructure graph, allowing the AI to understand relationships between components; a CPU spike on a node, a pod restart in Kubernetes, and a service latency alert in APM are no longer three separate incidents but three symptoms of one.
Deduplication eliminates repeated alerts for the same metric on the same CI within a time window, typically reducing 30 to 40 percent of alert traffic on its own. AI-based temporal correlation then groups the remaining alerts by causal pattern, learning over time which sequences of events belong to the same underlying incident. The output to operators is a single correlated alert story: one incident, one probable root cause, all contributing symptoms grouped. The pipeline integrates end-to-end with JIRA for ticketing and xMatters for escalation and notification.
The Outcome
Measurable reduction.
Lasting impact.
The pipeline went live and began actively deduplicating and correlating alerts across the estate. Alert reduction was evident at each layer as the pipeline came online; the combination of deduplication, topology correlation, and AI pattern detection is projected to reduce duplicate and false-positive alerts by 80 percent, benchmarked against industry standards and validated through pilot data. For a bank of TD's scale, that reduction translates directly into faster incident response, lower operational overhead, and reduced compliance risk.
The consolidation extended beyond the pipeline itself. Manual leadership reports that had previously required human compilation across multiple monitoring tools were replaced by active dashboards in Splunk, giving leadership real-time visibility into the health of the monitoring estate without the latency of periodic reporting.
"In 2025, the bank significantly updated its compliance monitoring by integrating AI-driven tools to enhance the detection of high-risk activity."
What began as an alert noise reduction initiative has since expanded into a broader AI-driven compliance monitoring capability at TD. The architectural foundation laid during this engagement, clean CI topology, normalized multi-source ingestion, and AI-based correlation, became the infrastructure on which TD's next generation of monitoring was built.
"What I'd do differently."
If I were starting this engagement today, I would begin with a larger team and push harder, earlier, to get the right stakeholders to the table. The most expensive moments in this engagement were the last-minute changes driven by late-arriving SMEs; people who had critical knowledge about specific tooling or CI ownership but weren't identified or made available until discovery was already well underway. In an environment this complex, stakeholder mapping is as important as technical architecture. The pipeline you design is only as accurate as the people who informed it, and every gap in that group becomes a gap in the topology, which becomes noise in the output. Getting the right people in the room early isn't a logistics problem; it's a product quality problem.