Over the years several types of clustering algorithms have been developed. These algorithms are generally divided into 4 subcategories (Partitioning algorithms, hierarchical algorithms, density algorithms, and model based algorithms). Partitioning algorithms are the most commonly used algorithms as they are simple and intuitive. Partitioning algorithms divide the data into several partitions and evaluate them by some criterion. In comparison model based algorithms hypothesize a statistical model for each cluster in the dataset under the condition that the hypothesized model is the best fit for that cluster. As part of research project to classify LiDAR data, I examined the similarities and differences between partitioning and model based clustering algorithms for tree specie classification. I used K-means and Expectation Maximization estimation as sample algorithms from the two categories above. This blog post will summarize my findings with some cool visualizations, but if you are interested in the more technical details you can read my full paper here.
K-means
K-means clustering is probably one of the first unsupervised learning algorithms that most people encounter when they begin a machine learning course. It is easy to use and intuitive. However if we indulge into the probabilistic theory behind K-means, it becomes apparent that the algorithm makes very general assumptions regarding the distribution of the data. The K-means algorithm attempts to detect clusters within the dataset under the optimization criteria that the sum of the inter-cluster variances is minimized. Hence K-Means clustering algorithm produces a Minimum Variance Estimate (MVE) of the mean of the identified clusters in the data.

![]()
The intuition behind the algorithm lies in the fact that on average the Euclidean distance from the cluster centroid (šš) to elements within the cluster should be homogeneous among all identified clusters (this is a short coming that we will see later on). Although the method works well in detecting homogeneous clusters, it inevitably falls short due to the simplistic assumption about the spherical nature of the clusters inherent in its optimization function (i.e. it assumes that all the clusters have equal covariance matrices).
Expectation Maximization Estimation
Expectation Maximization (EM) is another popular, though a bit more complicated, clustering algorithm that relies on maximizing the likelihood to find the statistical parameters of the underlying sub-populations (i.e. clusters) in the dataset. I will not get into the probabilistic theory behind EM, if you are interested you can read the full paper here.Ā But just to briefly summarize, the EM algorithm alternates between two steps (E-step and M-step). In the E-step the algorithm tries to find a lower bound function on the original likelihood using the current estimate of the statistical parameters. In the M-step the algorithm finds new estimates of those statistical parameters by maximizing the lower bound function (i.e. determine the MLE of the statistical parameters).Ā Since at each step we maximize the lower bound function, the algorithm always produces estimates with higher likelihood than the previous iteration and ultimately converge to a maxima.
So what makes EM so special over K-means? Unlike K-means, in EM, the clusters are not limited to spherical shapes. In EM we can constrain the algorithm to provide different covariance matrices (spherical, diagonal and generic). These different covariance matrices in return allow us to control the shape of our clusters and hence we can detect sub-populations in our data with different characteristics.

You may have noticed that the data, in the figure above, contains outliers shown as yellow points. Neither of the algorithms have the ability to detect outliers and hence we must pre-process the data to mitigate the effect of outliers from the detected clusters. EM, in particular, tends to be sensitive to the presence of outliers as clusters can grown beyond the actual data in order to classify the outliers.
The LiDAR Dataset
So now that we have some basic understanding of the EM works, we can talk about some results. I obtained a multispectral LiDAR dataset from Optech’s Titan sensor that had 3 channels (532 nm, 1064 nm and 1550 nm). The dataset was collected over Scarborough, Toronto at a flying height of 1500 meters. In order to classify the tree species I pre-processed the data to remove all non-tree cover types from it using thresholding techniques.


- LiDAR scan (Right Image) Titan multispectral LiDAR intensity data plotted in the spectral domain (Top Left: 1550 nm spectral band on the Vertical-Axis vs. 532 nm spectral band on the Horizontal-Axis, Top Right: 1550 nm spectral band on the Vertical-Axis vs. 1064 nm spectral band on the Horizontal-Axis, Bottom: 1064 nm spectral band on the Vertical-Axis vs. 532 nm spectral band on the Horizontal-Axis).
The dataset also contained outliers, that were removed during the pre-processing phase. Since I already had an implementation of K-means, I used Outlier Removal Clustering (ORC) method to filter the outliers from the dataset.
Clustering
Using the data in the 3 spectral features above, I initially determined the optimal number of clusters in the dataset. Partitioning and model based algorithms have this major setback where the number of clusters in the dataset needs to be provided by the user (density based algorithms do not suffer from this problem however other thresholds need to be specified). In order to determine the optimal number of clusters, I again made use of my existing implementation of k-means. I ran the k-means algorithm with a range of clusters and recorded the Sum of the Squared Errors (SSE) for each partition.

Ā I then used the asymptotic nature of the graph to pick out the optimal number of clusters in the dataset (this was based on subjective interpretation, however I found that the SSEĀ behaved reasonably asymptotically after 3 clusters).
I ran the k-means and different variant of the EM algorithm to obtain the following clusters. For the EM algorithm, I primarily looked at three configurations of the covariance matrices (spherical, diagonal and fully populated (or generic)).


- K-means (Left) vs EM with Spherical Covariance Matrix (Right)Ā


- EM with Generic Covariance MatrixĀ (Left) vs EM with Diagonal Covariance Matrix (Right)Ā
I found that the results from k-means were closest to the spherical variant of the EM algorithm, which is what we would expect. The diagonal variant of the EM allowed each cluster to have different sizes whereas the generic covariance matrix additionally gave each cluster a different orientation in the feature space.
While the generic covariance matrix allowed for greater flexibility in detecting different sub-populations in the data, it was certainly not without problems. During my programming spree I had some trouble with convergence while using the fully populated covariance matrix. I found that the solution at times would easily diverge due to large covariances. This was particularly true when I tested higher dimension datasets (i.e. hyperspectral images). The convergence problems of EM with full covariance model is nothing new; due to large number of parameters that need to be estimated convergence to a correct solution can also be painfully slow.
To conclude, the EM algorithm offers a powerful alternative to the popular k-means with greater control over the characteristics of the cluster. However the EM algorithm, like the k-means, also yields a sub-optimal solution. In other words there is no guarantee that the algorithm will produce models (i.e. clusters) that are best fit for the underlying sub-populations.