Real use case of k-means clustering in the security domain…👨💻👩💻
By- Sourabh Mishra
What is K-means Clustering?
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. The goal of the k-means algorithm is to find groups in the data, with the number of groups represented by the variable k.
Now let’s look at how K-means clustering being used in Security Domain.
Crime analysis is defined as analytical processes which provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities.
The main objectives of crime analysis include:
- Extraction of crime patterns by analysis of available crime and criminal data.
- Prediction of crime based on spatial distribution of existing data and anticipation of crime rate using different data mining techniques
- Detection of crime
The procedure is given below:
- First we take crime dataset
- Filter dataset according to requirement and create new dataset which has attribute according to analysis to be done
- Open rapid miner tool and read excel file of crime dataset and apply “Replace Missing value operator” on it and execute operation
- Perform “Normalize operator” on resultant dataset and execute operation
- Perform k means clustering on resultant dataset formed after normalization and execute operation
- From plot view of result plot data between crimes and get required cluster
- Analysis can be done on cluster formed.
k-means: Algorithm K-means clustering is one of the method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
- Initially, the number of clusters must be known let it be k
- The initial step is the choose a set of K instances as centres of the clusters.
- Next, the algorithm considers each instance and assigns it to the cluster which is closest.
- The cluster centroids are recalculated either after whole cycle of re-assignment or each instance assignment.
- This process is iterated. K means algorithm complexity is O(tkn), where n is instances, c is clusters, and t is iterations and relatively efficient . It often terminates at a local optimum. Its disadvantage is applicable only when mean is defined and need to specify c, the number of clusters, in advance. It unable to handle noisy data and outliers and not suitable to discover clusters with non-convex shapes