Lookalike Modeling Is Just Conditional Probability. SQL Can Do Conditional Probability.

The pitch for “lookalike modeling” usually comes wrapped in ML scaffolding — Python notebooks, feature pipelines, model training jobs, inference endpoints. All of that machinery implies something sophisticated is happening. Sometimes it is. Often, it isn’t. And when it isn’t, you’re paying for complexity you don’t need.

Here’s what lookalike modeling actually is: given a seed audience (people you know match some criteria), find the people in your broader population who most resemble them. That’s it. The question is how you define “resemble” and how you compute it at scale.

For categorical and continuous attributes stored in a SQL database, you can build a statistically principled lookalike scorer without leaving SQL. The algorithm is Naive Bayes. The implementation fits in a single query. The performance tuning is interesting.

The Statistical Intuition

Naive Bayes asks: given that someone is in the seed audience, how much more likely are they to have attribute X compared to the general population? If 40% of your seed audience bought shoes but only 10% of the general population did, shoes is a strong positive signal. Encode that as a log-likelihood ratio and you can score any candidate by summing their log-weights across attributes.

The formula for a categorical attribute:

w(value) = ln( P(value | seed) / P(value | population) )

With Laplace smoothing to handle values that appear in one group but not the other:

w(value) = ln(
  (seed_count(value) + 1) / (seed_total + cardinality)
  /
  (pop_count(value) + 1) / (pop_total + cardinality)
)

A positive weight means this value is over-represented in the seed. A negative weight means it’s under-represented. A candidate’s total score is the sum of all their weights — the higher the score, the more they “look like” the seed.

Building It in SQL

The query structure follows the math directly. First, compute per-attribute weights for the seed versus population:

WITH seed_att1 AS (
  SELECT att_1, APPROX_COUNT_DISTINCT(id)::float AS cnt
  FROM labeled_ids WHERE is_seed = 1 GROUP BY att_1
),
pop_att1 AS (
  SELECT att_1, APPROX_COUNT_DISTINCT(id)::float AS cnt
  FROM labeled_ids WHERE is_seed = 0 GROUP BY att_1
),
att1_card AS (
  SELECT COUNT(*)::float AS k FROM (SELECT att_1 FROM labeled_ids GROUP BY att_1) t
),
att1_w AS (
  SELECT
    COALESCE(sa.att_1, pa.att_1) AS att_1,
    LN(
      ((COALESCE(sa.cnt, 0) + 1.0) / (sc.total + ac.k))
      /
      ((COALESCE(pa.cnt, 0) + 1.0) / (pc.total + ac.k))
    ) AS w
  FROM seed_att1 sa
  FULL OUTER JOIN pop_att1 pa ON sa.att_1 = pa.att_1
  CROSS JOIN seed_count sc CROSS JOIN pop_count pc CROSS JOIN att1_card ac
)

Then score candidates by joining against the weight tables and summing. Phase 1 does a pre-filter: only keep candidates with at least one positive-weight attribute. This dramatically shrinks the candidate set before the final scoring pass, which is the expensive operation.

For continuous attributes (like age or purchase amount), you can compute a Gaussian probability: how far is this candidate’s value from the seed mean, in terms of standard deviations? Candidates close to the seed mean score higher.

The Multi-Value Attribute Problem

The first version of this query used MODE() to canonicalize each person to their “most common” value for categorical attributes. That falls apart immediately in practice.

If someone bought shoes and pants, their product category attribute has two valid values. Picking one erases signal. MODE() is fine for truly single-valued attributes (gender, country, cohort), but for behavioral attributes — purchase history, interests, categories — you need to preserve the full distribution.

The fix is to drop the canonicalization step and switch to frequency-weighted scoring. Instead of “what is this person’s single category,” ask “across all this person’s rows, what fraction of their purchases were in each category?” Then compute a weighted sum:

SELECT
  c.id,
  SUM(c.freq_weight * w.w) AS score
FROM candidate_freq c
JOIN att1_w w ON w.att_1 = c.att_1
GROUP BY c.id

Where freq_weight is row_count / total_rows_for_this_id. A person who bought mostly shoes gets most of their score from the shoes weight, with partial credit for the pants weight. This is more honest to the data and scores more accurately.

Performance at Scale

A million-user population with five attributes generates a lot of CTEs. The performance problems are predictable:

COUNT(DISTINCT) is slow. Every distinct count requires sorting or hashing the full column. Replace with APPROX_COUNT_DISTINCT() where exactness isn’t required (it almost never is for lookalike scoring — you don’t need to know the seed has exactly 1,000 members versus 1,003).

ORDER BY ... LIMIT n is slow for large outputs. This operation can’t start returning rows until it’s scored and ranked everything. If you need the top 10,000 lookalikes out of a million candidates, you’re materializing all million scores first. Pre-filtering with the phase 1 step (positive-weight attributes only) shrinks this significantly.

Multiple passes over the same table. The original query hit labeled_ids a dozen times — once per attribute, per direction (seed vs. population). Move shared computations into materialized views and treat them as checkpoints. Once canonical_ids and labeled_ids are materialized, downstream CTEs read from a snapshot instead of recomputing from source.

The materialized view structure also makes incremental updates tractable. A daily refresh can recompute only the weight tables (att1_w, att2_w) when the underlying data changes, without touching the canonicalization layer if the source data schema is stable.

What You Actually Built

At the end of this, you have a SQL pipeline that:

Computes a seed population profile across categorical and continuous attributes
Applies Laplace-smoothed log-likelihood weights to handle sparse values correctly
Scores every non-seed candidate by how well their attribute distribution matches the seed
Handles multi-value attributes without collapsing them to a single canonical value
Uses materialized view checkpoints to avoid recomputing stable intermediate results
Scales to millions of candidates without touching Python

The “ML stack” version of this — feature extraction, model training, inference, serving — adds real value when your features are high-dimensional, sparse, or require learned representations. If you’re working with a handful of behavioral and demographic attributes in a warehouse, that stack is overhead. The conditional probability is already in the data. SQL is capable of doing the math.

The question worth sitting with is how often we reach for complex tooling because the problem sounds like it requires it, not because it actually does. Lookalike modeling sounds like machine learning. Underneath, it’s a ratio. SQL can compute ratios.