Implementation Of Clustering Algorithm K Mean K Medoid Computer Science Essay

Data Mining is a fairly recent and modern topic in computing. However, Data Mining applies many older computational techniques from statistics, machine learning and pattern recognition. This paper explores two most popular clustering techniques will be the k-means& k-medoids clustering algorithm. However, k-means algorithm is cluster or to group your objects based on attributes into K variety of group and k-medoids is a related to the K-means algorithm. These algorithms derive from the k partition algorithms and both attempt to minimize squared error. As opposed to the K-means algorithm K-medoids chooses data points as centres. The algorithms have been developed in Java, for integration with Weka Machine Learning Software. The algorithms have been run with two dataset Facial palsy and Stemming. It really is having been proven that the algorithm is normally faster and much more accurate than other clustering algorithms.

Data Mining derives its name from the similarities between searching for valuable business information in a huge database (for example, finding linked products in gigabytes of store scanner data) and mining a mountain for a vein of valuable ore. [1] Both process requires either sifting through an immense amount of material. Or intelligently probing it to find wherever the worthiness resides.

Data Mining

Data mining is also called "knowledge mining". Before it was named "DATA MINING", it was called "data collection", data warehousing" or "data access". Data mining tools predicts the behaviours of the models that are loaded in the info mining tools (like Weka) for analysis, allowing making predicted analysis, of the model. Data mining provides hands-on and practical information.

Data mining is the most effective tool currently available. Data mining can be utilized for modelling in fields such as artificial intelligence, and neural network.

What does it do?

Data mining take the info which exists in unrelated patterns and designs, and uses this data to predict information which is often compared in conditions of statistical and graphical results. Data mining distil / filters the information from the data that is inputted and final model is generated.

Clustering

"What's cluster analysis? "Unlike classification and prediction, which analyse class-labeled data objects, clustering analyses data objects without consulting a known class label.

A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster "center" is marked with a "+". [6]

Clustering is the technique by which like objects are grouped together. The objects are clustered or grouped predicated on the principle of maximizing the intra class similarity and minimizing the interclass similarity. i. e. clusters of the objects are made so the clusters have resemblance compared to one another, but are very divergent to objects in other clusters. Each cluster that is made can be viewed as a class of objects, that rules can be derived. [6]

Problem overview

The problem accessible can correctly cluster a facial palsy dataset which is given by our lecturer. This section will provide a synopsis of dataset being analysed, and description about dataset that people use in this implementation.

Data Set

1. 3. 1. 1 Facial_Palsy_svmlight_format

Facial Palsy data is made for binary classification.

+1 severe facial palsy faces

-1 Non-severe or normal faces

66 Principal components made from 50x50 Hamming distance images

1. 3. 1. 2 A6_df2_stemming__svm:

Attributes: 100

A6_df2_stemming__svm_100. dat

+1 Open question

-1 Closed question

Section 2 - Methodology

This section will firstly discuss the methodology behind K-means & k-medoids algorithm. It really is than followed by steps to implement k-means and k medoids algorithms. How many input, output and what exactly are the steps to perform k-means and k-medoids.

2. 1 K-mean

K-means clustering starts with an individual cluster in the centre, as the mean of the info. Here after the cluster is put into 2 clusters and the mean of the new cluster are iteratively trained. Again these clusters are split and the procedure goes on before specified amounts of the cluster are obtained. If the specified amount of cluster is not a power of two, then your nearest power of two above the number specified is selected and then your least important clusters are removed and the rest of the clusters are again iteratively trained to get the final clusters. If an individual specifies the random start, random cluster is made by the algorithm, and it goes ahead by fitting the data points into these clusters. This technique is repeated often in loops, for as much random numbers the user chooses or specifies and the best value is found by the end. The output values are displayed.

The drawbacks of the clustering method are that, the measurement of the errors or the uncertainty is ignored from the data.

Algorithm: The k-means algorithm for partitioning, where each cluster's centre is represented by the mean value of the objects in the cluster.

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

Method:

(1) Arbitrarily choose k objects from D as the original cluster centers;

(2) Repeat

(3) reassign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;

(4) Update the cluster means, i. e. , calculate the mean value of the objects for every cluster;

(5) Until no change;

Where E is the sum of the square error for any objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional). Quite simply, for every object in each cluster, the length from the object to its cluster center is squared, and the distances are summed. This criterion tries to help make the resulting k clusters as compact so that as separate as it can be. [2]

Clustering of a couple of objects based on the k-means method. (The mean of every cluster is marked by way of a "+". )

2. 2 K- Medoids

This report recommends a new algorithm for K-medoids, which runs like the K-means algorithm. The algorithm proposed scans and calculates distance matrix, and make use of it for finding new medoids at every constant and repetitive step. The evaluation is based on real and artificial data and is also compared with the results of the other algorithms.

Here we have been discussing the approach on k- medoids clustering, using the k-medoids algorithm. The algorithm is usually to be implemented on the dataset which contain uncertain data. K-medoids are implemented because they to represent the located objects called medoids in a cluster. Here the k-medoids algorithm can be used to get the representative objects called the medoids in the dataset.

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning predicated on medoids or central objects.

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A couple of k clusters.

Method:

(1) Arbitrarily choose k objects in D as the initial representative objects or seeds;

(2) Repeat

(3) Assign each remaining object to the cluster with the nearest representative object;

(4) Randomly decide on a no representative object, o random;

(5) Compute the full total cost, S, of swapping representative object, oj, with o random;

(6) If S < 0 then swap oj with o random to create the new set of k representative objects;

(7) Until no change;

Where E is the sum of the absolute error for all objects in the info set; p is the point in space representing a given object in cluster Cj; and oj is the representative object of Cj. Generally, the algorithm iterates until, eventually, each representative object is really the medoids, or most located object, of its cluster. This is actually the basis of the k-medoids way for grouping n objects into k clusters. [6]

2. 3 Distance Matrix

An important part of most clustering is to choose a distance measure, that may determine how the similarity of two elements is calculated.

Common distance metrics:

Euclidean

Manhattan

Minkowski

Hamming etc

Here in our implementation we choose two distance matrix that you can see below with description.

2. 3. 1 Euclidean Distance Metric

The Euclidean distance between point's p and q is the space of the line segment. In Cartesian coordinates, if p = (p1, p2. . . pn) and q = (q1, q2. . . qn) are two points in Euclidean n-space, then your distance from p to q is distributed by:

2. 3. 2 Manhattan Distance Metric

The Manhattan (or taxicab) distance, d1, between two vectors in a n-dimensional real vector space with fixed Cartesian coordinate system, is the sum of the lengths of the projections of the line segment between points onto the coordinate axes.

Section 3 - Discussion

In this section we have been discussing about how precisely Weka Machine learning work and how exactly we implemented both k-means and k medoids algorithm. To implement both of these algorithms we use Java and we are explaining how we implemented in java which function we use to be able to implement these two algorithms.

3. 1 Weka Machine Learning

Weka is a machine learning software made using Java and a great many other languages. Weka has a assortment of tools that are being used to analyse the data that the user inputs by means of dataset files. Weka supports more than four different input data formats. Weka uses an interactive GUI interface, which is straightforward for an individual to use. Weka supplies the functionality for testing and visual aid options that may be used by an individual to compare and sort the results.

3. 2 Implementation

In this section, we discuss about implementation of 2 clustering algorithms: K-Means and K-Medoids. Here, we use Object Oriented Programming to implement these 2 algorithms. The structure of program as below

There are 3 packages: K-Mean, K-Medoid, main.

Files in K-Mean package

Centroid. java

Cluster. java

KMean_Algorithm. java

KMean_Test. java

KMean_UnitTest. java

Files in K-Medoid package

KMedoid_Algorithm. java

KMedoid_UnitTest. java

Files in main package

Attribute. java

DataPoint. java

DistanceCalculation. java

FileFilter. java

MainFrame. java

Utilities. jav

There are some main functions implemented for clustering activity as below

3. 2. 1 read_SVMLightFile_fill_up_missing_attribute()

This function is approximately reading the SVM Light data file (. dat) and fill all the missing attributes/values in data file before returning a Vector of data-points for clustering activity.

3. 2. 2 calculate_distance()

This function offers calculation based on the distance metric input in order to calculate distance between data objects for clustering activity. Overall, this function provides calculation for 3 different distance metrics as: Euclidean, Manhattan and Minkowski.

3. 2. 3 startClustering()

This function is about running a particular clustering algorithm and returns a Vector of Clusters with the own data-points inside. All of the steps of a particular clustering algorithm is implemented, here we implement K_Means and K_Medoids clustering algorithms.

3. 2. 4 calculateSumOfSquareError()

This function is about calculating the total/sum square error for all your output clusters. By calling the function "calculateSquareError()" inside every cluster and sum up, the sum of Square Error will be calculated as long as the clustering activity finished.

3. 2. 5 calculateSumOfAbsoluteError()

This function is approximately calculating the total/sum absolute error for all your output clusters. By calling the function "calculateAbsoluteError()" inside every cluster and summarize, the sum of Absolute Error will be calculated so long as the clustering activity finished.

3. 2. 6 toString() and main()

The toString() function will return a string which represents the clustering output, including: total objects of every cluster, percent of object in every cluster, the error (such as: sum of square error or sum of absolute error), the centroid of every cluster and all the data-points clustered in the clusters.

The main() function inside MainFrame. java class will allow to execute the GUI of this program, so users can interact with system by GUI instead of console or command-line. In this particular GUI, users can pick type of distance metric (such as Euclidean and Manhattan), Clustering algorithm (such as K-Means and K-Medoids) and enter input parameters such as variety of clusters and quantity of iterations for clustering activity. Besides, users also can open any data file to see or modify and save before running clustering as well as export the initial data file with missing attributes/values to new processed data file with all missing values chock-full by zero (0).

Section 4 - Analysis

In order to access the performance of the K-means & k-medoids clusters, two dataset of analyses was carried out. The aim of this set to tests was provide an indicator as to how well the clusters performed using the k-means and k-medoids function. The tests were involved comparing the cluster to other cluster of varied types provided within Weka cluster suite. The results are summarised throughout the rest of the section.

4. 1 Experiment (Facial Palsy dataset) results vs. Weka

Here In this particular section how we did a comparison with this application algorithm vs. Weka you can see below.

In this pattern we give iterations when we operate a dataset with our application and Weka.

Iterations: 10 >> 30 >> 50 >> 100 >> 200 >> 300 >> 400 >> 500

In this pattern we provide a cluster whenever we operate a dataset with our application and Weka.

Clusters: 2 >> 3 >> 4 >> 5

After we run dataset with this format than each and every run we get result we incorporate that result, equate to Weka, we make a complete of each and every column and come with average and we are displaying in table that you can view in below table.

This Symbol is object. To visit a result please select this object it'll demonstrate result. We put as object because result is too large in proportions so we cannot devote this A4 page.

4. 2 Experiment (Stemming Question dataset) results vs. Weka

Here Within this section how exactly we did a comparison with our application algorithm vs. Weka you can view below.

In this pattern we give iterations when we run a dataset with this application and Weka.

Iterations: 10 >> 30 >> 50 >> 100 >> 200 >> 300 >> 400 >> 500

In this pattern we give a cluster when we operate a dataset with this application and Weka.

Clusters: 2 >> 3 >> 4 >> 5

After we run dataset with this format than each and every run we get result we combine that result, compare with Weka, we make a total of the column and include average and we are displaying in table that you can view in below table.

This Symbol is object. To see a result please click on this object it will show you result. We put as object because result is too big in size so we are not able to put in this A4 page.

Section 5 - Conclusion

In evaluating the performance of data mining techniques, in addition to predicative accuracy, some researchers have been done the importance of the explanatory nature of models and the need to reveal patterns that are valid, novel, useful and may be most importantly understandable and explainable. The K-means and k-medoids clusters achieved this by successfully clustering with facial palsy dataset.

"Which method is more robust-k-means or k-medoids?" The k-medoids method is more robust than k-means in the presence of noise and outliers, just because a medoids is less influenced by outliers or other extreme values than a mean. However, its processing is more expensive than the k-means method. Both methods require the user to specify k, the amount of clusters.

Aside from using the mean or the medoids as a measure of cluster center, other alternative measures are also commonly found in partitioning clustering methods. The median can be used, leading to the k-median method, where in fact the median or "middle value" is taken for every single ordered attribute. Alternatively, in the k-modes method, the most frequent value for every attribute is used.

5. 1 Future Work

The K-means algorithm can create some in efficiency as; it scans the dataset leaving some noise and outliners. These small flaws can be considered major for some of the users, but this won't means that the implementation can be prevented. It will always be possible that sometimes the dataset is more efficient to check out other algorithms better, and the effect distribution can be equal or acceptable. It is always advisable to make the dataset better by removing unwanted attributes and more meaning full by pre-processing the nominal values to the numeric values.

5. 2 Summery

Throughout this report the k-mean and the k-medoids algorithms are implemented, which find the best result by scanning the dataset and creating clusters. The algorithm was developed using Java API and much more Java classes.

Also We Can Offer!

Ошибка в функции вывода объектов.