Post

A Cleaner Way to Categorize Probabilistic Models

A Cleaner Way to Categorize Probabilistic Models

The classic split between generative and discriminative models was introduced for supervised learning: compare Naïve Bayes (model $p(x, y)$) with Logistic Regression (model $p(y\mid x)$).

In contemporary ML, the term “generative” is used loosely to refer to any model that is not discriminative, regardless of whether it actually specifies a probability distribution, a latent generative story, or even a sampling mechanism. The result is a terminology that is historically overloaded, pedagogically misleading, and conceptually inaccurate. The field needs a cleaner taxonomy—one that matches what probabilistic models actually do, rather than relying on outdated terminology. Here we propose such a framework.

1. Primary axis: Distribution modeling vs. Conditional modeling

The first, clean question is:

What probability does the model aim to capture?

We propose to replace “generative vs. discriminative” with Distribution modeling and Conditional modeling.

Distribution Modeling

The model’s primary target is the data distribution $p(x)$. Examples include:

  • latent-variable generative models (GMM, HMM, ICA, VAE, RBM, diffusion-as-generative-chain)
  • normalizing flows
  • autoregressive models
  • EBMs
  • score models
  • efficient-coding / equal-mass partition models
  • Bayesian networks modeling only $p(x)$
  • Markov chains defining a stationary distribution

Conditional Modeling

The model’s target is a conditional distribution $p(y\mid x)$. Examples:

  • logistic regression
  • CRFs / HCRFs
  • Bayesian logistic regression

This axis is minimal, clean, and matches probability theory:

  • “Distribution” = model $p(x)$
  • “Conditional” = model $p(y\mid x)$

No extra philosophical baggage.

2. Second axis: Latent vs. non-latent

Next, we ask:

Does the model explicitly introduce latent variables $z$ as part of its probabilistic story?

  • Latent = Yes The model introduces hidden variables (sources, codes, states, classes, etc.) with a distribution $p(z)$ and some relationship to $x$.

  • Latent = No The model operates directly on $x$: it may factorize $p(x)$, define a flow, an energy, a score, a Markov chain over $x$, or an equal-mass partition—without any explicit latent $z$.

For distribution models:

  • Latent + distribution ⇒ latent-variable generative models in the classical statistical sense: they specify a joint $p(x,z) = p(z)p(x\mid z)$ and a sampling story.

  • Non-latent + distribution ⇒ non-latent distributional models: they define or approximate aspects of $p(x)$ without introducing explicit latents.

For conditional models:

  • Latent or non-latent just tells you whether you use hidden variables in $p(y\mid x, z)$ etc.

3. Third axis: Density form

For distribution models, how is $p(x)$ represented?

  • Exact density $\log p_\theta(x)$ is tractable. e.g., flows, autoregressive models, many GMMs, some Bayesian networks.

  • Approximate density Optimized via a bound (e.g. ELBO). e.g., VAEs, DDPM-style diffusion models.

  • Unnormalized density Only energy $E(x)$; $p(x) \propto e^{-E(x)}$ with unknown partition function. e.g., EBMs, RBMs.

  • Implicit density Defined only via a sampler or transformation; no closed-form $p(x)$. e.g., GANs, some Markov chains, pure score models (where you know only $\nabla\log p)$.

4. Fourth axis: Sampling route

For distribution models, how do we sample from the learned distribution?

  • Direct / ancestral One pass through a generative story or invertible map. e.g., GMM, naive Bayes, ICA, sparse coding, Bayesian networks, flows.

  • Iterative / Markovian Sampling requires a chain: MCMC, Gibbs, Langevin, reverse diffusion, autoregressive token-by-token. e.g., RBM, EBMs, score-based SDEs, DDPM, Markov chains, autoregressive models.

  • No sampling defined The model’s goal is to capture structure in $p(x)$ without providing a generative sampler.

5. A unified table of probabilistic models

Here’s the updated table using your preferences:

  • Axis 1: Target (Distribution vs. Conditional)
  • Axis 2: Latent? (Yes / No)
  • Axis 3: Density type
  • Axis 4: Sampling
  • Axis 5: Inference / training

Table: Categorizing Probabilistic Models

ModelTarget (Dist./Cond.)Latent?Density typeSamplingInference / Training
GMMDist.YesExactDirect (ancestral)EM / MLE
Naive BayesCond. (or Dist.)Yes (class as latent or observed)ExactDirectMLE
Bayesian networkDist. or Cond.Yes (if hidden nodes)Exact (if structured)DirectMLE / EM / VI
Markov chain (stationary dist.)Dist.NoImplicitIterative (chain)Transition estimation / MLE
HMM / SSMDist.YesExact (marginalizable)Direct (ancestral)EM / VI
RBMDist.YesUnnormalized jointIterative (Gibbs / CD)Contrastive divergence / ML
VAEDist.YesApprox. (ELBO)Direct (ancestral from prior)Variational inference
Diffusion models (DDPM-style)Dist.Yes (multi-step noise latents)Approx. (ELBO-like)Iterative (reverse diffusion)Variational / score-style
Score-based SDE modelsDist.NoImplicit (score only)Iterative (reverse SDE/ODE)Score matching
Normalizing flowsDist.NoExactDirect (invertible map)MLE
Autoregressive modelsDist.NoExactIterative (sequential)MLE
EBM (visible only)Dist.No (or latents if extended)UnnormalizedIterative (MCMC) or noneContrastive / ML
GANDist.Yes (noise input)ImplicitDirect (generator forward)Adversarial (min–max)
ICADist.YesExact (often)Direct (sample sources, mix)MLE / InfoMax
Sparse codingDist.YesApprox. (often MAP-based)Direct (sample sparse codes)Recon + sparsity, sometimes VI
PPCA / Factor analysisDist.YesExactDirectEM
Even code IPU modelDist.NoImplicit (equal-mass partition of $p(x)$)NoneInformation-theoretic
Logistic regressionCond.NoExact $p(y\mid x)$N/ACE / MLE
Bayesian logistic regressionCond.Yes (parameter posterior)Posterior over paramsN/AVI / MCMC
CRF / HCRFCond.Yes (structured latents)Exact / structured cond. likelihoodN/AConditional likelihood / max-margin
This post is licensed under CC BY 4.0 by the author.