Expectation maximization

Expectation maximization is a method of finding maximum likelihood estimates of the parameters of a model. The method alternates between making an expectation (E) step based on the current estimate of the parameters and a maximization (M) step, which computes new parameters.

Published on Mon, Mar 16, 2020
Last modified on Sun, Sep 7, 2025
528 words - Page Source

K-means clustering

Suppose we have a dataset {\textbf{x}_1, \ldots, \textbf{x}_n} consisting of $N$ observations of a random $D$ -dimensional space. The goal is to partition the dataset into some number $K$ of clusters. Formally, let { \pmb{\mu} _1, \ldots, \pmb{\mu}_k } be a set of $D$ -dimensional vectors in which $μ μ_{k}$ is associated with the $k^{t h}$ cluster ( $μ μ_{k}$ can be thought of as the centers of the clusters). The goal is to find an assignment of data points so that the distance of each data point to its closest vector $μ μ_{k}$ is a minimum.

Let $r_{n} k \in 0, 1$ , where $k = 1, \dots, K$ , describe the assignment of each data point to a cluster (1 if it’s assigned to a cluster and 0 if not). We define a function called the distortion measure, given by:

\begin{matrix} (1) & J = \sum_{n = 1}^{N} \sum_{k = 1}^{K} r_{n k} {‖ x_{n} - μ μ_{k} ‖}^{2} \end{matrix}

This represents the sum of the squares of the distances of each data point to its assigned vector $μ μ_{k}$ . Our goal is to find values for $r_{n} k$ and the $μ μ_{k}$ so as to minimize $J$ . The algorithm is as follows:

Algorithm:

Pick initial values for the $μ μ$ .
Minimize J with respect to $r_{n} k$ , keeping the $μ μ_{k}$ fixed (Expectation).
Minimize J with respect to the $μ μ_{k}$ , keeping $r_{n} k$ fixed (Maximization).

Multivariate gaussian distribution

For a random variable $X$ with a finite number of outcomes $x_{1}, x_{2}, \dots, x_{n}$ occurring with probabilities $p_{1}, p_{2}, \dots, p_{n}$ , the expectation of $X$ is defined as:

E [X] = \sum_{i = 1}^{N} x_{i} p_{i}

The covariance between two variables, $X$ and $Y$ , is defined as the expected value (or mean) of the product of their deviations from their individual expected values:

c o v (X, Y) = E [(X - E [X]) (Y - E [Y])]

When working with multiple variables $X_{1}, X_{2}, \dots, X_{n}$ , the covariance matrix, denoted as $Σ$ , is the $n \times n$ matrix whose $(i, j)$ -th entry is $c o v (X_{i}, X_{j})$ .

The density function of a univariate gaussian distribution is given by:

p (x; μ, σ) = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2 σ^{2}} (x - μ)^{2})

$(x - μ)^{2}$ is always positive.
The value $k (x, μ) = - \frac{1}{2 σ^{2}} (x - μ)^{2}$ is always negative. It’s a parabola pointing downward.
The $\exp (k (x, μ))$ part makes sure that the quantity is always $\geq 0$ .
The normalization factor $\frac{1}{\sqrt{2 π σ^{2}}}$ multiplies $\exp (k (x, μ))$ so that this sum equals 1.

\underset{normalization factor}{\underset{⏟}{\frac{1}{\sqrt{2 π σ^{2}}}}} \int_{- \infty}^{\infty} \exp (- \frac{1}{2 σ^{2}} (x - μ)^{2}) = 1

A vector random variable $X = [X_{1}, \dots, X_{n}]^{T}$ is said to have a multivariate Gaussian distribution with mean $μ \in R^{n}$ and covariance matrix $Σ$ if its probability density function is given by:

p (x; μ, Σ) = \frac{1}{(2 π)^{n / 2} | Σ |^{1 / 2}} \exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))

Like in the univariate case, the argument of the exponential function is a downward-opening bowl. The coefficient in front is a normalization factor used to ensure that: