Super Connectors Are Ruining Your Customer Graph

Somewhere in your transaction data, there’s an email address that belongs to 80,000 customers. Not because 80,000 people share a Gmail. Because someone at a checkout terminal typed cashier@store.com when a customer didn’t have an account, and then every cashier at every location did the same thing, for years.

If you’re building an identity graph — linking customers across data sources by their shared PII tokens — this is the node that will quietly collapse your entire dataset into a single, useless blob.

What an Identity Graph Is Actually Doing

Identity resolution is a graph problem. Customers are nodes. Shared PII tokens — email addresses, phone numbers, payment tokens — are edges. Two records that share an email address get an edge between them. Run a connected components algorithm on the resulting graph, and every cluster of connected records becomes a single customer identity.

This works extraordinarily well for the common case: a customer who shops online with one email and in-store with a different card, but uses the same phone number for both. The graph connects those records, and you know they’re the same person.

The problem is what happens at the extreme.

The Super Connector Problem

A “super connector” in a graph is a node with anomalously high degree — more edges than any plausible real-world explanation can justify. In customer identity graphs, super connectors are almost always shared PII tokens: the family plan phone number that shows up on five different accounts, the company credit card used by a whole procurement team, the prepaid card that gets loaded and reloaded by different buyers, or the placeholder email that got entered at point of sale a hundred thousand times.

None of these are errors in the data. The records are accurate. The problem is what happens when you include these nodes in your connected components run.

A single shared phone number with 50,000 customer records on it becomes a 50,000-node component. Every one of those customers — who may have nothing else in common — gets assigned the same identity. Your graph has just told you that 50,000 different people are the same person.

And because connected components are transitive, it gets worse. If even one of those 50,000 customers also shares a different token with a separate cluster of 20,000 people, those 20,000 people are now in the same component too. One sufficiently connected super node can corrupt enormous swaths of an otherwise clean graph.

The Cutoff Decision

The solution isn’t to ignore high-degree nodes — it’s to make a principled decision about which ones represent genuine signal versus structural noise.

Running identity resolution across a dataset with 1.2 billion edges, we’ve landed on a cutoff of 10,000: any PII token connected to more than 10,000 distinct source records gets excluded from the edge set before running connected components.

The threshold isn’t magic. It’s calibrated to the realistic upper bound of “genuinely shared” tokens. A family plan might have 6–8 accounts. A small business credit card might touch 20–30 employees. Even a large corporate card program with loose controls probably tops out at a few thousand. At 10,000, you’re well past any scenario where shared PII reflects a real relationship between the underlying people.

Above that threshold, the node is almost certainly either infrastructure noise — placeholder emails, shared business accounts — or a fraud vector: a stolen card or synthetic identity that’s been used so many times it’s linking together records that have nothing in common except being victimized by the same bad actor.

What the Filtered Nodes Tell You

Here’s what’s counterintuitive: the filtered super connectors aren’t garbage. They’re one of the most interesting subsets of your data.

We built two materialized views to track them separately. eliminated_super_connectors captures the nodes we removed: for each, how many distinct source records it touches, and — critically — how many connected components those records would have been merged into had we kept the node. That second number is the damage estimate. A PII token touching 100,000 records that would have collapsed 500 independent components into one giant blob is a very different kind of problem than one that only appears in a single isolated cluster.

The super_connectors view captures high-degree nodes we kept — ones whose degree is elevated but which connect records across multiple distinct components in ways that seem worth preserving. These are the borderline cases: nodes that are probably shared, but where the pattern suggests a real relationship rather than infrastructure noise.

The key thing to resist: reading the eliminated super connectors as “things that were wrong” — data quality issues to be corrected. They’re not. The shared tokens are probably accurate. Including them still destroys the graph. We’re not saying the data is bad. We’re saying that connecting records through these nodes produces a worse model of reality than not connecting them.

Convergence and the Nodes You Missed

There’s a third category that doesn’t show up cleanly in either view: records that are still too connected even after super-connector filtering, and whose components never fully converged before the algorithm ran out of iterations.

Graph algorithms that run iteratively — like many large-scale connected components implementations — can fail to fully propagate labels through extremely dense subgraphs within the allowed number of passes. What you’re left with is a set of records that should logically be in the same component, but aren’t — because the path between them runs through a node that was filtered before the algorithm had a chance to traverse it.

This is why you want visibility into both the nodes you removed and the nodes that remain high-degree after filtering. The first tells you what you prevented. The second tells you what you might have missed.

Why This Generalizes

The super connector problem isn’t specific to identity resolution. It shows up anywhere you’re building graphs from real-world co-occurrence data.

Recommendation systems where a blockbuster product appears in so many purchase baskets that it creates spurious similarity between unrelated users. Citation networks where review articles are cited by so many papers that they become meaningless connectors between unrelated research areas. Social graphs where verified accounts have so many followers that their neighborhoods lose analytical meaning.

The general principle: in any graph built from observed co-occurrence, nodes with degree far above the mean are structural features of your data collection process, not signal about the underlying phenomenon. The right response is to filter them, track them separately, and study what the filtering reveals.

The cutoff is always a judgment call. The discipline is in making it explicitly, with a principled rationale, and with enough instrumentation to know what you’re giving up when you do.