In density estimation, we represent the data compactly using a density from a parametric family like Gaussian distribution or beta distribution. In using a Gaussian distribution for densty modelling, we seek to find the mean and variances of that represents such data compactly. Point estimation like Maximum Likelihood Estimation (MLE) or Maximum A posterior (MAP) can be used to find such mean and variance.

Most of the time, single model do fail in expressibility. In other words there is limitation in a modelling the data well. So we consider a mixture of such models.

Mixture Models

A mixture model is used to describe a distribution $P(x)$ by a convex combination of K base distributions. I.e we seek a number, K, model distributions that are combined with weights. Which can be represented mathematically below as

\begin{align*} p(x) = \sum_{k=1}^{K}\pi_kP_k(x) \end{align*}

where $\pi_k$ represents the weights and with the condition that \begin{align*} 0 \leq \pi_k \leq 1 \space , \space \sum_{k=1}^{K} \pi_k = 1 \end{align*}

Assuming the models are Gaussian, then we would be resulting into a Gaussian Mixture Model (GMM) which would be the focus of the tutorial. Other distributions that has a "name" could one way or the other be combined to have a mixture model examples are; Beta distribution, Gamma distribution and so on. \newline Now our Gaussian Mixture Model would now be written mathematically as; \begin{align*} p(x) = \sum_{k=1}^{K}\pi_k \mathcal{N}(x|\mu_k, \Sigma_k) \end{align*}

From above equation, we discover that our GMM now has 3 different parameters and they are $\theta = \{\pi_k, \mu_k, \Sigma_k\}$. so what we need to do now is to find a closed form solution for the parameter $\theta$ using the MLE and MAP. Just for heads up, we really cannot find a closed form solution, but we would see why we can't get such a closed form solution and then discuss way around to find the parameter.

So given our data $X = \{ x_1, \dots, x_N\}$ and parameter $\theta$, we would have a likelihood as shown below: \begin{align*} P(X|\theta) = \prod_{n=1}^{N}p(x_n|\theta) \end{align*} Assume our data is identically independently distributed (iid) then we would have a factorized likelihood as follows \begin{align*} P(X|\theta) =\prod_{n=1}^{N} \sum_{k=1}^{K}\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k) \end{align*} So taking the logarithm of the likelihood as we always do in MLE, our resulting equation becomes: \begin{align*} \log P(X|\theta) =\sum_{n=1}^{N} \log \sum_{k=1}^{K}\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k) \end{align*} Take a look at the last equation and find out how we could proceed analytically. We discovered that the logarithm function can't "enter" the sum. But if our K =1, its possible to move from there, but since we have multiple k's, we can not find a closed form solution. So how do we proceed to get the parameters $\theta$, we turn to an iterative scheme in finding the good parameters and Expectation Maximization (EM) algorithm is here for the bail.

A necessary condition for getting local optimum is that the derivative of the log likelihood with respect to each parameters must be zero. See the reference if you're interested in the detailed derivation of the maths behind it. Here I am just going to state the formula and we proceed. After derivation, we arrive at what we call responsibility which is closely related to the likelihood as follows: \begin{align*} r_{nk} = \frac{\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k)}{\sum_{k=1}^{K}\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k)} \end{align*}

An important check is that our responsibility must sum up to 1 across each data-points i.e $\sum_{k=1}^{K}r_{nk}= 1$

Hence, $r_{nk}$ distributes a probability mass across the K components. The next line of action is to update the means and covariance matrix of each of the components using the responsibility $r_{nk}$ and the updated parameters are: \begin{align*} \boldsymbol\mu_k^\text{new} &= \frac{\sum_{n = 1}^Nr_{nk}\boldsymbol x_n}{\sum_{n =1}^N r_{nk}}\,,\\ \boldsymbol\Sigma_k^\text{new}&= \frac{1}{N_k}\sum_{n=1}^Nr_{nk}(\boldsymbol x_n-\boldsymbol\mu_k)(\boldsymbol x_n-\boldsymbol\mu_k)^\top\,,\\ \pi_k^\text{new} &= \frac{N_k}{N} \end{align*}

The math details of the formula can be checked out at the reference section. The EM Algorithm can be written as follows:

Choose initial values for $ \mu_k$, $\Sigma_k$, $\pi_k $
Evaluate the responsibility $r_{nk}$
Use the updated $r_{nk}$ to update the $\{\pi_k, \mu_k, \Sigma_k\}$

If convergence condition is not met, repeat step 2 and 3 \

And so we are done with some theoretical description of Gaussian Mixture Model.

Let us see how to approach this via coding them up. We would approach this as a modelling approach and give reasons for some of the choices

#collapse-hide
import numpy as np
from scipy import stats
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

Data

The data we are using is a geo-location data gathered from 30 students in various countries from Africa. You are given a two-dimensional dataset of about 30 geolocations (latitude/longitude) of birthplaces. The task here is to use GMM to model the data.

#collapse-hide
data = [[6.14512, 7.14512], [6.5582, 3.3466], [5.4646, 10.0631], [14.1372, -16.0754], [35.6177, 5.0999], [0.2774244, 32.4783873],
        [5.56508, -0.1601], [15.65611, 32.48492], [14.7074454, -17.474397], [14.7765117, -17.3226387], [4.03106, 9.73532], [24.713139, 46.622186], 
        [15.65792, 32.46231], [52.4378, -0.76], [13.441518, -6.268147], [4.1058, 9.7529], [10.1345, 14.7117], [5.56, -0.2057], [-1.8848, 29.7772],
        [3.9389, 9.7437], [7.4431, 3.9022], [7.8797, 2.7675], [21.54238, 39.19797], [10.633, 38.783], [-1.6166, 30.1166], [0.2408, 32.5523], 
        [15.79255, 32.52596], [6.531475, 3.32405], [4.9535, 9.9382], [-1.9008438, 36.2843654]] 
# convert our list to an ndarray 
X = np.asarray(data)

Modelling

In Model selection, we use the idea from Occam’s razor technique that says we should favour simple model over complicated ones when we have fewer datasets. One specific challenge in our case here is that the data points are few so it would be difficult getting a good density estimation.

My approach in modelling this data is to first use Affinity Propagation which is an algorithm that ”guess” the number of suitable clusters for the data.

#collapse-hide
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

5

This algorithm outputs 5 clusters as the optimal number of clusters.

The first thoughts that comes to mind in modelling a datasets like this is to consider using K-means clustering.

Let us use KMeans algorithm from scikit learn and use it as a model for data to figure out what is going on here.

kmeans = KMeans(n_clusters=5, random_state=0)

def plot_kmeans(kmeans, X, n_clusters=5, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)

    # plot the representation of the KMeans model
    centers = kmeans.cluster_centers_
    radii = [cdist(X[labels == i], [center]).max()
             for i, center in enumerate(centers)]
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))

plot_kmeans(kmeans, X)

Using K-Means is not a bad idea but then as seen above, it makes a hard clustering on the data and also the shape of the clusters is not most times intuitive. Als, we can see that some data points lies on the boundary of the shape of the clusters.

We consider a generalization of K- means which is the Gaussian Mixture Model (GMM).

Next is to try and also bring into account the variance represented in the clustering which GMM takes care of. In GMM we also do have to specify the number of components of Gaussian models to mix. In this case, I decided to use 4 clusters just a ±1 to the number of clusters we have before.

Modelling with GMM

The first step in our modelling is to initialize our parameters which are the means, covariance and responsibility (prior). $ \mu_k$, $\Sigma_k$, $\pi_k $

N, D = X.shape[0], X.shape[1]

# Number of mixture models
K = 2

means = np.zeros((K, D))
covs = np.zeros((K, D, D))
priors = np.zeros((K, 1))

initialCovariance = np.cov(X.T)

for i in range(0, K):
    means[i] = np.random.rand(D) #the initial mean for each gaussian is chosen randomly
    covs[i] = initialCovariance #the initial covariance of each cluster is the covariance of the data
    priors[i, 0] = 1.0 / K  #the initial priors are uniformly distributed.

initial_params = {"mu": means, "S":covs, "pi":priors}

The second step in the algorithm is to evaluate the responsibility $r_{nk}$ which is the E-step

#collapse-show
def E_step( mu, S, pi):
    """Compute the E step of the EM algorithm.

    Arguments:
      mu -- component means, K x D array
      S -- component covariances, K x D x D array
      pi -- component weights, K x 1 array

    Returns:
      r_new -- updated component responsabilities, N x K array
    """
  
    # Assert that all arguments have the right shape
    assert(mu.shape == (K, D) and S.shape  == (K, D, D) and pi.shape == (K, 1))

    r_new = np.zeros((N, K))
    # Here we implement the E step and return updated responsabilities
    for k in range(K):
        S[k] = S[k] + 1e-6* np.eye(D)
        r_new[:, k] = pi[k] * stats.multivariate_normal.pdf(X, mu[k], S[k])
    r_new = r_new / (r_new.sum())
    assert(r_new.shape == (N,K))

    return r_new

The third step, M-step, uses the updated $r_{nk}$ from E-step is to update the $\{\pi_k, \mu_k, \Sigma_k\}$ given below as

#collapse-show
def M_step(mu, r):
    """Compute the M step of the EM algorithm.

    Arguments:
      mu -- previous component means, K x D array
      r -- previous component responsabilities,  N x K array

    Returns:
      mu_new -- updated component means, K x D array
      S_new -- updated component covariances, K x D x D array
      pi_new -- updated component weights, K x 1 array
    """
    assert(mu.shape == (K, D) and\
           r.shape  == (N, K))
    mu_new = np.zeros((K,D))
    S_new  = np.zeros((K, D, D))
    pi_new = np.zeros((K, 1))

    # Task 2: implement the M step and return updated mixture parameters
    for i in range(0, K):
        pi_new[i, 0] = np.sum(r[:, i]) / N   #update priors

        for j in range(0, N): #update means
            mu_new[i] += r[j, i] * X[j, :]

        mu_new[i] /= np.sum(r[:, i])

    for i in range(0, K):
        for j in range(0, N): #update means
            vec = np.reshape(X[j, :] - mu_new[i, :], (D, 1))
            S_new[i] += r[j, i] * np.multiply(vec, vec.T) #update covs

        S_new[i] /= np.sum(r[:, i])

    assert(mu_new.shape == (K, D) and S_new.shape  == (K, D, D) and pi_new.shape == (K, 1))
    
    return mu_new, S_new, pi_new

The next and final step is to implement the EM algorithm that helps us to train our Gaussian mixture model.

The training step basically involves running our E and M steps itereatively until a stopping criteria is reached. Our stopping criteria is dependent on the log-likelihood estimate considering that "every step in the EM algorithm increases the log-likelihood function"

#collapse-hide
def train(initial_params):
    """Fit a Gaussian Mixture Model (GMM) to the data in matrix X.

    Arguments:
      initial_params -- dictionary with fields 'mu', 'S', 'pi' and 'K'

    Returns:
      mu -- component means, K x D array
      S -- component covariances, K x D x D array
      pi -- component weights, K x 1 array
      r -- component responsabilities, N x K array
    """
    # Assert that initial_params has all the necessary fields
    assert(all([k in initial_params for k in ['mu', 'S', 'pi']]))
    mu = initial_params["mu"]
    S = initial_params["S"]
    pi = initial_params["pi"]
    # Task 3: implement the EM loop to train the GMM
    num_Iter = 30
    log_likelihoods = []
    loglik_old =0
    threshold = 0.001
    i = 0
    while i < num_Iter:
        # Run The E -step
        r = E_step(mu, S, pi)
        
        # Run M-Step to update the parameters.
        mu, S, pi = M_step(mu, r)

        loglikelihood = 0.0
        for l in range(N):
            s = 0
            for j in range(K):
                S[j] = S[j] + 1e-6* np.eye(D)
                s += pi[j] * stats.multivariate_normal.pdf(X[l], mu[j], S[j])
            loglikelihood += np.log(s)
        log_likelihoods.append(loglikelihood)
        
        # check for convergence
        if len(log_likelihoods) < 2 : continue
        if np.abs(loglikelihood - log_likelihoods[-1]) < threshold: 
            print("Likelihood has converged !!")
            break
        i += 1

    assert(mu.shape == (K, D) and S.shape  == (K, D, D) and
           pi.shape == (K, 1) and r.shape  == (N, K))
    
    return mu, S, pi, r

Now that we have writtent the code all that we need to find our parameters for the GMM, we can now train our model.

#collapse-show
mu_, S_, pi_, r_ = train(initial_params)

Likelihood has converged !!

We can now visualize our model by running GMM on our data.

def plot_ellipse(pos, cov, nstd=2, ax=None, **kwargs):
        def eigsorted(cov):
            vals, vecs = np.linalg.eigh(cov)
            order = vals.argsort()[::-1]
            return vals[order], vecs[:,order]
    
        if ax is None:
            ax = plt.gca()
    
        vals, vecs = eigsorted(cov)
        theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))
    
        # Width and height are "full" widths, not radius
        width, height = 2 * nstd * np.sqrt(abs(vals))
        ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwargs)
    
        ax.add_artist(ellip)
        return ellip

def show(X, mu, cov):

    plt.cla()
    K = len(mu) # number of clusters
    colors = ['b', 'k', 'g', 'c', 'm', 'y', 'r']
    plt.plot(X.T[0], X.T[1], 'm*')
    for k in range(K):
      plot_ellipse(mu[k], cov[k],  alpha=0.6, color = colors[k % len(colors)])

X.shape, mu_.shape, S_.shape

((30, 2), (2, 2), (2, 2, 2))

fig = plt.figure(figsize = (12, 12))
# fig.add_subplot(121)
show(X, mu_, S_)

Now we observe that a data-point could belong to one or more clusters (soft clustering). Also we could see that in GMM model, the shape is different to the clustering. If we decide to increase the number of clusters to 5, then we would be having overlap datapoints. That is a data point could fall within two clusters. Hence, we have been able to provide a better modelling for our data.

But then model selection comes to play. How do we know the best number of three components of gaussian mixture to use. We use the Bayesian Information Criterion (BIC) to determine the optimal number of components. The figure below shows the result of using BIC on our Gaussian Mixture Models.

Comments on GMM:

Here are few comments about GMM:

Bayesian Inforamtion Criterion (BIC), Akaike Information Criterion(AIC) or Cross validation are suitable methods of finding the rightnumber of K for Gaussian Mixture Model.
When there are no enough probability mass assigned to yet unseenvalues that could occur in the dataset or in the extreme case whenwe have one of the means situated at the data-point, we result intooverfitting.
Overcoming the overfitting can be done by setting a variance floor such that we make sure no variance of the data points goes below that value. It would potentially avoid numerical problems also.
Observe GMM falls under clustering techniques hence, we can useGMM to get to K-Means clustering if we ignore the covariances. i.e while GMM makes soft clustering with the responsibilities, K Meansmakes hard clustering to the centres.

So in conlucsion, we have studied about Density estimation and in particularGaussian Mxture Models, there are other density estimation techniques thatcould be used also and they are; Kernel density estimation, and histogramscan be used.Head o

Reference

Mathemtics for Machine Learning https://mml-book.github.io/