TD Bank AIOps — Case Study

The Brief

A monitoring estate
nobody fully
understood.

TD Bank's monitoring organization came to us with a clear problem and a complex landscape. Across four infrastructure layers, L1 through L4, two physical data centers in Scarborough and Barrie, and multiple lines of business, dozens of monitoring tools were running in parallel. Many were performing overlapping or redundant functions. Few were connected to each other. The same underlying incident could trigger independent alerts simultaneously across multiple layers and teams, none of whom had visibility into what the others were seeing.

The result was a NOC and operations function buried in alert volume, spending significant time triaging noise rather than responding to real incidents. In a bank processing billions in daily transactions, delayed incident response carries direct financial and compliance consequences. The brief was to architect an AI-powered pipeline that could ingest alerts from across this fragmented estate, correlate them intelligently, and surface a single actionable incident story to operators, instead of dozens of disconnected alerts.

Monitoring Tools in Scope

IBM Netcool Datadog Dynatrace Splunk Grafana SolarWinds PagerDuty xMatters F5 Ansible CyberArk Tufin SecureTrack Hitachi Suite Dell Suite + Additional tools

The Discovery

Mapping what
nobody had
mapped before.

Before IBM AIOps could correlate a single alert, we needed a clean, accurate map of every Configuration Item across the estate; who owned it, what was monitoring it, and how it related to everything else. That map didn't exist. It lived partially in ServiceNow's CMDB, partially in the heads of SMEs across multiple teams, and partially nowhere at all.

We structured discovery across all four layers. L1 was relatively cooperative; a small group of IBM Netcool specialists could walk us through the foundational alerting architecture. L2 and L3 became progressively more complex, requiring us to work with TD leadership to identify and engage three to four SMEs per layer across the primary lines of business. Getting the right people to the table required both executive support and a deliberate effort to create psychological safety in the sessions; these were conversations that surfaced gaps, redundancies, and ownership ambiguities that teams hadn't previously been asked to expose.

The CMDB work was particularly intensive. We downloaded CI ownership lists from ServiceNow Discovery, cleansed the data, and went back to LOB SMEs to validate and correct it. There were instances where the CMDB was simply wrong; ownership had changed, tools had been decommissioned, or new infrastructure had been added without corresponding CMDB updates. Every discrepancy had to be resolved before it could feed the topology model.

"Getting the right people in the room is harder than building the pipeline. The AI is only as good as the topology feeding it."

We also encountered something that made the problem significantly harder: duplicate monitoring within single lines of business. The same CI was being monitored simultaneously by Datadog and Dynatrace, with both feeding into Splunk and reporting through Grafana. This wasn't negligence; it was the result of different teams acquiring different tools independently over time, with no central governance to rationalize the estate. Before AIOps could deduplicate alerts across teams, we first had to eliminate redundant monitoring within them.

The Hard Calls

Three decisions
that shaped
the outcome.

Decision 01

Starting at L1 and Building Outward

With four layers and multiple LOBs in scope, there was a deliberate choice about where to begin. We started at L1, the most cooperative layer with the clearest ownership and the most mature tooling. This wasn't the path of least resistance for its own sake; it was a sequencing strategy. Establishing a clean, validated topology at L1 gave the AI pipeline a reliable foundation to build on, and it gave us credibility with the harder layers above. By the time we were engaging L2 and L3 SMEs, we had a working model to show rather than a theoretical proposal to sell.

Outcome → Built credibility and pipeline foundation before tackling organizational complexity

Decision 02

Leading With Data to Drive Tool Decommissioning

Getting LOB teams to turn off monitoring tools they had chosen and trusted was one of the most politically sensitive challenges of the engagement. These teams had deployed duplicate tooling for a reason; they didn't trust other layers to catch what they needed caught. Rather than leading with authority or top-down mandates, we led with data. The financial case was clear: redundant licensing costs, redundant maintenance overhead, and measurable alert noise directly attributable to tool duplication. Where teams resisted at the SME level, the cost savings narrative gave leadership the justification to move the decision upward. Executive cover was the backstop, not the opening move.

Outcome → Decommissioned redundant tools without triggering organizational resistance

Decision 03

Flagging End-of-Life Tools and Recommending Replacements

During the CI mapping and discovery process, we identified several tools across the estate that were at or approaching end-of-life, carrying security risk, unsupported vendor patches, or both. Rather than treating these as out of scope, we documented them and brought forward replacement recommendations as part of the broader engagement. This expanded the brief beyond alert noise reduction into a monitoring estate health assessment. A pipeline built on top of EOL tooling would have introduced new risk at the same time it was reducing old risk; the right call was to surface it.

Outcome → Delivered a monitoring estate health assessment alongside the AIOps pipeline

The Architecture

From fragmented
noise to a single
alert story.

The IBM AIOps pipeline operates in several stages. Alerts from all monitoring tools are ingested and normalized into a common schema, so the system can reason across sources that previously spoke different languages. The cleaned CI topology from ServiceNow feeds a dynamic infrastructure graph, allowing the AI to understand relationships between components; a CPU spike on a node, a pod restart in Kubernetes, and a service latency alert in APM are no longer three separate incidents but three symptoms of one.

Deduplication eliminates repeated alerts for the same metric on the same CI within a time window, typically reducing 30 to 40 percent of alert traffic on its own. AI-based temporal correlation then groups the remaining alerts by causal pattern, learning over time which sequences of events belong to the same underlying incident. The output to operators is a single correlated alert story: one incident, one probable root cause, all contributing symptoms grouped. The pipeline integrates end-to-end with JIRA for ticketing and xMatters for escalation and notification.

Alert Pipeline — End to End

Multi-Source Ingestion → Normalization → CI Topology Mapping → Deduplication → AI Correlation → Root Cause Analysis → Single Alert Story → JIRA · xMatters

The Outcome

Measurable reduction.
Lasting impact.

The pipeline went live and began actively deduplicating and correlating alerts across the estate. Alert reduction was evident at each layer as the pipeline came online; the combination of deduplication, topology correlation, and AI pattern detection is projected to reduce duplicate and false-positive alerts by 80 percent, benchmarked against industry standards and validated through pilot data. For a bank of TD's scale, that reduction translates directly into faster incident response, lower operational overhead, and reduced compliance risk.

The consolidation extended beyond the pipeline itself. Manual leadership reports that had previously required human compilation across multiple monitoring tools were replaced by active dashboards in Splunk, giving leadership real-time visibility into the health of the monitoring estate without the latency of periodic reporting.

80% Projected alert noise reduction · 5× industry benchmark

Live Pipeline active in production across TD's monitoring estate

4 Layers L1 through L4 unified across multiple lines of business

Legacy · 2025

"In 2025, the bank significantly updated its compliance monitoring by integrating AI-driven tools to enhance the detection of high-risk activity."

Public reporting on TD Bank's AI monitoring capabilities · 2025

What began as an alert noise reduction initiative has since expanded into a broader AI-driven compliance monitoring capability at TD. The architectural foundation laid during this engagement, clean CI topology, normalized multi-source ingestion, and AI-based correlation, became the infrastructure on which TD's next generation of monitoring was built.

Reflection

"What I'd do differently."

If I were starting this engagement today, I would begin with a larger team and push harder, earlier, to get the right stakeholders to the table. The most expensive moments in this engagement were the last-minute changes driven by late-arriving SMEs; people who had critical knowledge about specific tooling or CI ownership but weren't identified or made available until discovery was already well underway. In an environment this complex, stakeholder mapping is as important as technical architecture. The pipeline you design is only as accurate as the people who informed it, and every gap in that group becomes a gap in the topology, which becomes noise in the output. Getting the right people in the room early isn't a logistics problem; it's a product quality problem.

← Back to Portfolio

Turning AlertChaos intoSignal

A monitoring estatenobody fullyunderstood.

Mapping whatnobody hadmapped before.

Three decisionsthat shapedthe outcome.

From fragmentednoise to a singlealert story.

Measurable reduction.Lasting impact.