Information is a key driver for any type of organisation. However, with the rapid growth of the volume of data, valuable information can be hidden being unnoticed due to the lack of effective data processing and analysing mechanism.

The clustering is a machine learning mechanism that organises the data by finding data points that are similar to each other.
As it is a unsupervised mechanism, the data are not labeled.

Clustering has many applications in different domains such as biology, business, healthcare.

Here is straightforward example.



Company C has 100 employees. All the employees can be divided in several groups, in function their role in the company : RH, marketing, sale,  IT.

You can categorise all the techs people : Font end, Back end, full stack, Android dev, IOS dev, CTO, intern…


It is really common to use the Clustering to do the image processing.

In breast cancer detection, a given mammogram is clustered in to several parts for further analysing as given in figure 2. The regions of interest for signs of breast cancer in the mammogram can be identified using K-means algorithm, which I am going to explain in this tutorial.

Image features such as pixel, color, intensity, and texture are used during clustering.


Screen Shot 2015-04-12 at 11.38.04



Hard Clustering vs. Soft Clustering

Clustering techniques can be divided in to hard clustering or soft clustering based on the cluster membership.

In hard clustering, a given data point only belongs to one cluster. This is also known as exclusive clustering. K-means clustering mechanism is an example for hard clustering.

A given data point can belongs to more than one clusters in soft clustering. This is also known as overlapping clustering. Fuzzy K-means algorithm is a good example for soft clustering.



Screen Shot 2015-04-12 at 12.16.22



I am going to present only K-means clustering.


K-means clustering

K-means clustering is the simple and fast clustering algorithm that has been widely adopted in many problem domains. K-means assigns data point to cluster centroids to minimise the distance from data points to cluster centroids.

In hard clustering, one data point belongs only to one cluster. However, there can be situations where one point belongs to more than one cluster. For example, a news article may belong to both “Technology” and “Finance” categories.



The K mean clustering is done in 3 steps :

  1.  Determine initials centroids randomly per cluster of points, called C1, C2 and C3 here.
  2. Assign closest points of the cluster to the centroid.
  3. Determine new centroid, by averaging data points from this cluster.
  4. Repeat until convergence of a specified number of iterations.


The K from “K means” comes from the number of clusters used.

 Java implementation



private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT = "Kmeansdata";

public static void main(String[] args) throws Exception {
// Path to output folder 
Path output = new Path("Kmeansoutput");
// Hadoop configuration details
Configuration conf = new Configuration();
	     HadoopUtil.delete(conf, output);
run(conf, new Path("KmeansTest"), output, new EuclideanDistanceMeasure(), 2, 0.5, 10);
public static void run(Configuration conf, Path input, Path output, DistanceMeasure measure, int k,
    	      double convergenceDelta, int maxIterations) throws Exception {

// Input should be given as sequence file format 
    	    Path directoryContainingConvertedInput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT);
    	    InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector");
    	    // Get initial clusters randomly 
    	    Path clusters = new Path(output, "random-seeds");
    	    clusters = RandomSeedGenerator.buildRandom(conf, directoryContainingConvertedInput, clusters, k, measure);
    	    // Run K-means with a given K, directoryContainingConvertedInput, clusters, output, convergenceDelta,
    	        maxIterations, true, 0.0, false);
    	    // run ClusterDumper to display result
    	    Path outGlob = new Path(output, "clusters-*-final");
    	    Path clusteredPoints = new Path(output,"clusteredPoints");
    	    ClusterDumper clusterDumper = new ClusterDumper(outGlob, clusteredPoints);

Where the data is stored in


About the importants objects and parameters used in the code :


  • Path – Path to file or directory in file system
  • DistanceMeasure – determine the distance between 2 points.  Can be customised
  • K – Number of clusters
  • convergenceDelta – a double value used to determine if algorithm has converged
  • maxIterations – maximum number of iterations to run
  • runClustering – If true, the clustering step is to be executed after clusters have been determined.
  • runSequential – If true, that k-means sequential implementation is to be used to process the input data.


The output will be under this format :

Screen Shot 2015-04-12 at 12.58.04

After running the algorithm, you will see the clustering output generated for each iteration and the final result in the file system as given below:

Screen Shot 2015-04-12 at 12.59.24



More about K means here.

Distance measures

Clustering problem is based on evaluating the distance between data points. Distance measure is an indicator of the similarity of the data points. For any clustering algorithm, you need to make a decision on the distance measure. Essentially, distance measure is more important for accuracy than number of clusters. Criteria for choosing the right distance measure depends on the application domain and the dataset.


Apache Mahout provides severals distance measures :


  • Euclidean distance : basic one.
  • Squared Euclidean distance : better performance than Euclidean distance. It saves computation time by not taking the square root as in Euclidean distance.
  • Cosine distance : suitable for text clustering
  • Manhatten distance : better  with high dimensions
  • Tanimoto distance : captures information about relative distance between two points and the angle
  • Weighted distance : measure gives weights for different dimensions to evaluate similarity


The selection of the  best distance measure is really important to optimise the performance of the algorithm.



Clustering mechanisms can be divided into different types such as hard, soft, flat, hierarchical and model based clustering based on different criteria.

K-means algorithm is a simple and fast algorithm that is widely applied. However, there are situations which K-means algorithm will not be able to cater. For such scenarios, Apache Mahout has implemented other algorithms such as Canopy, Fuzzy K-means, Streaming and Spectral clustering.

There are several techniques to optimise cluster performance to get expected results such as selecting right features, distance measure and algorithms.



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>