Post

A Cleaner Way to Categorize Probabilistic Models

A Cleaner Way to Categorize Probabilistic Models

The classic split between generative and discriminative models was introduced for supervised learning: compare Naïve Bayes (model $p(x, y)$) with Logistic Regression (model $p(y\mid x)$).

In contemporary ML, the term “generative” is used loosely to refer to any model that is not discriminative, regardless of whether it actually specifies a probability distribution, a latent generative story, or even a sampling mechanism. The result is a terminology that is historically overloaded, pedagogically misleading, and conceptually inaccurate. The field needs a cleaner taxonomy—one that matches what probabilistic models actually do, rather than relying on outdated terminology. Here we propose such a framework.

1. Primary axis: Distribution modeling vs. Conditional modeling

The first, clean question is:

What probability does the model aim to capture?

We propose to replace “generative vs. discriminative” with Distribution modeling and Conditional modeling.

Distribution Modeling

The model’s primary target is the data distribution $p(x)$. Examples include:

  • latent-variable generative models (GMM, HMM, ICA, VAE, RBM, diffusion-as-generative-chain)
  • normalizing flows
  • autoregressive models
  • EBMs
  • score models
  • efficient-coding / equal-mass partition models
  • Bayesian networks modeling only $p(x)$
  • Markov chains defining a stationary distribution

Conditional Modeling

The model’s target is a conditional distribution $p(y\mid x)$. Examples:

  • logistic regression
  • CRFs / HCRFs
  • Bayesian logistic regression

This axis is minimal, clean, and matches probability theory:

  • “Distribution” = model $p(x)$
  • “Conditional” = model $p(y\mid x)$

No extra philosophical baggage.

2. Second axis: Latent vs. non-latent

Next, we ask:

Does the model explicitly introduce latent variables $z$ as part of its probabilistic story?

  • Latent = Yes The model introduces hidden variables (sources, codes, states, classes, etc.) with a distribution $p(z)$ and some relationship to $x$.

  • Latent = No The model operates directly on $x$: it may factorize $p(x)$, define a flow, an energy, a score, a Markov chain over $x$, or an equal-mass partition—without any explicit latent $z$.

For distribution models:

  • Latent + distribution ⇒ latent-variable generative models in the classical statistical sense: they specify a joint $p(x,z) = p(z)p(x\mid z)$ and a sampling story.

  • Non-latent + distribution ⇒ non-latent distributional models: they define or approximate aspects of $p(x)$ without introducing explicit latents.

For conditional models:

  • Latent or non-latent just tells you whether you use hidden variables in $p(y\mid x, z)$ etc.

3. Third axis: Density form

For distribution models, how is $p(x)$ represented?

  • Exact density $\log p_\theta(x)$ is tractable. e.g., flows, autoregressive models, many GMMs, some Bayesian networks.

  • Approximate density Optimized via a bound (e.g. ELBO). e.g., VAEs, DDPM-style diffusion models.

  • Unnormalized density Only energy $E(x)$; $p(x) \propto e^{-E(x)}$ with unknown partition function. e.g., EBMs, RBMs.

  • Implicit density Defined only via a sampler or transformation; no closed-form $p(x)$. e.g., GANs, some Markov chains, pure score models (where you know only $\nabla\log p)$.

4. Fourth axis: Sampling route

For distribution models, how do we sample from the learned distribution?

  • Direct / ancestral One pass through a generative story or invertible map. e.g., GMM, naive Bayes, ICA, sparse coding, Bayesian networks, flows.

  • Iterative / Markovian Sampling requires a chain: MCMC, Gibbs, Langevin, reverse diffusion, autoregressive token-by-token. e.g., RBM, EBMs, score-based SDEs, DDPM, Markov chains, autoregressive models.

  • No sampling defined The model’s goal is to capture structure in $p(x)$ without providing a generative sampler.

5. A unified table of probabilistic models

  • Axis 1: Target (Distribution vs. Conditional)
  • Axis 2: Latent? (Yes / No)
  • Axis 3: Density type
  • Axis 4: Sampling
  • Axis 5: Inference / training
  • Axis 6: Primary Use / Function

The last two columns corresponde to Shu and Yong in another post.

Table: Categorizing Probabilistic Models

ModelTarget (Dist./Cond.)Latent?Density typeSamplingInference / TrainingPrimary Use / Function
GMMDist.YesExactDirect (ancestral)EM / MLEClustering, density modeling, mixture discovery
Naive BayesCond. (or Dist.)Yes (class as latent or observed)ExactDirectMLEFast classification, interpretable generative classifier
Bayesian networkDist. or Cond.Yes (if hidden nodes)Exact (if structured)DirectMLE / EM / VIStructured reasoning, causal inference, joint modeling
Markov chain (stationary dist.)Dist.NoImplicitIterative (chain)Transition estimation / MLEModeling long-run distributions, stochastic processes
HMM / SSMDist.YesExact (marginalizable)Direct (ancestral)EM / VISequence modeling, state discovery, temporal inference
RBMDist.YesUnnormalized jointIterative (Gibbs / CD)Contrastive divergence / MLFeature learning, generative structure discovery
VAEDist.YesApprox. (ELBO)Direct (ancestral from prior)Variational inferenceRepresentation learning, generative modeling
Diffusion models (DDPM-style)Dist.Yes (multi-step noise latents)Approx. (ELBO-like)Iterative (reverse diffusion)Variational / score-styleHigh-quality sample generation, image/audio synthesis
Score-based SDE modelsDist.NoImplicit (score only)Iterative (reverse SDE/ODE)Score matchingGenerative modeling via learned score function
Normalizing flowsDist.NoExactDirect (invertible map)MLEExact likelihood modeling, invertible generative mapping
Autoregressive modelsDist.NoExactIterative (sequential)MLEHigh-quality sequence generation, density modeling
EBM (visible only)Dist.No (or latents if extended)UnnormalizedIterative (MCMC) or noneContrastive / MLEnergy-based representation learning, structure discovery
GANDist.Yes (noise input)ImplicitDirect (generator forward)Adversarial (min–max)Fast generation, realistic sample synthesis
ICADist.YesExact (often)Direct (sample sources, mix)MLE / InfoMaxSource separation, latent-factor discovery
Sparse codingDist.YesApprox. (often MAP-based)Direct (sample sparse codes)Recon + sparsity, sometimes VISparse representation learning, dictionary learning
PPCA / Factor analysisDist.YesExactDirectEMDimensionality reduction, latent-factor discovery
Even code IPU modelDist.NoImplicit (equal-mass partition)NoneInformation-theoreticRepresentation learning
Logistic regressionCond.NoExactN/ACE / MLEPredictive classification, discriminative modeling
Bayesian logistic regressionCond.Yes (global parameter posterior)Posterior over parametersN/AVI / MCMCParameter inference, uncertainty quantification
CRF / HCRFCond.Yes (structured latents)Exact / structured cond. likelihoodN/AConditional likelihood / max-marginStructured prediction (e.g., sequences, segmentation)
This post is licensed under CC BY 4.0 by the author.