Get Instant Help From 5000+ Experts For

Writing: Get your essay and assignment written from scratch by PhD expert

Rewriting: Paraphrase or rewrite your friend's essay with similar meaning at reduced cost

Question:

Download BBC sports dataset from the Cloud. This dataset consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005. There are 5 class labels: athletics, cricket, football, rugby, tennis. The original dataset and raw text files can be downloaded from here

1. There are 3 files in the dataset corresponding to the feature matrix, the class labels and the term dictionary. You need to read these files in Python notebook and store in variables X, trueLabels, and terms.
2. Next perform K-means clustering with 5 clusters using Euclidean distance as similarity measure. Evaluate the clustering performance using adjusted rand index and adjusted mutual information. Report the clustering performance averaged over 50 random initializations of K-means
3. Repeat K-means clustering with 5 clusters using a similarity measure other than Euclidean distance. Evaluate the clustering performance over 50 random initializations of K-means using adjusted rand index and adjusted mutual information. Report the clustering performance and compare it with the results obtained in step 2
4. For clustering cases (Euclidean distance and the other similarity measure), visualize the cluster centres using Tag cloud using Python package WordCloud.
A Clustering

The assignment of separating the information focuses into different gatherings along with the end goal that information focuses in the same gatherings are exact like the other information focuses that are in the same gathering when compared to those in the different gatherings. Basically, it points to isolate the bunch with a comparable attributes and allocates them into groups, which is called as clustering. Comprehensively, grouping can be isolated into two subgroups. They are as follows (Aggarwal & Reddy, 2016):

• Hard Clustering: In hard Clustering, every datum point either has a place with a group totally or not. For instance, in the above case every client is put into one gathering out of 10 gatherings.
• Soft Clustering:In soft Clustering, rather than putting every datum point into a different group, a likelihood or probability of that information point to be in those bunches is appointed.
Dataset

This task aims to perform clustering on provided data set that is, BBC sports data set from the cloud. This data set contains 737 documents from BBC sports’ website according to the sports news articles. Here, we open the provided data set. The provided data set contains three files like, BBC sports classes, BBC sports matrix and BBC sport terms. These files are shown below.

K-Means Clustering

K-Means is likely the most understood bunching calculation. It is educated in a considerable measure of starting information science and machine learning classes. It is straightforward and can be actualized in code and can check out the realistic delineation (Kaushik, 2016).

1. To start, we initially select various classes/gatherings to utilize and randomly introduces their separate focuses. To make sense of the quantity of classes to utilize, it is great to investigate the information and endeavour, to recognize any unmistakable groupings. The middle focuses are vectors of indistinguishable length from every datum point vector and are the "X's" in the above realistic.
2. Each information point is ordered by processing the separation between that point and each gathering focus, and afterwards characterizing the point to be in the gathering whose middle is nearest to it.
3. Based on these characterized focuses, we recomputed the gathering focus by taking the mean of the considerable number of vectors in the gathering.
4. Repeat these means for a set of number emphases or until the point that the gathering focuses don't change much between the cycles. You can likewise pick random introduce for gathering focuses a couple of times, and afterwards select the run that appears as though it provided the best outcomes (Celebi, 2016).

K-Means has the preferred standpoint that it's truly quick, as all we're truly doing is registering the separations among the focuses and gather focus; not many calculations! It hence has a direct multifaceted nature O(n).

Then again, K-Means has few inconveniences. Right off the bat, you need to choose what number of gatherings/classes there are. This isn't constantly unimportant and preferably with a clustering calculation we'd need it to make sense of those for us in the light of the fact that the purpose of it is to increase some knowledge from the information. K-means likewise begins with an arbitrary decision of group focus and subsequently it might yield diverse clustering results on various keeps running of the calculation. Along these lines, the outcomes may not be repeatable and need consistency. Other bunch of strategies are more reliable.

K-Medians is another clustering calculation identified with K-Means, aside from as opposed to recomposing the gathering focuses on utilizing the mean, so we utilize the middle vector of the gathering. This technique is less touchy to anomalies (on account of utilizing the Median) however it is much slower for bigger datasets as arranging is required on every emphasis when registering the Median vector.

Utilization of K-Means Clustering

k-means strategy is utilized for isolating the perceptions into similar bunches, in the light of their portrayal by an arrangement of quantitative factors. K-means clustering has the accompanying points of interest specifically as follows:

• A protest might be relegated to a class amid one cycle at that point, change the class in the accompanying emphasis, which isn't conceivable with the Agglomerative Hierarchical Clustering, where the task cannot be reversed.
• With the duplication of the beginning stages and reiterations, a few arrangements might be investigated.

Grouping criteria for k-means Clustering

A few grouping reasons might be utilized for achieving the answer. XLSTAT provides four factors as limited:

• Trace (W) or Median
• Determinant (W)
• Trace (W)
• Wilks lambda

Results of k-means grouping in XLSTAT

• The optimization outline: This is a table which demonstrates the development of the inside class difference. On the off chance that, few redundancies have been asked for the outcomes, for each reiteration are shown.
• Statistics for every cycle: Activate this choice to see the development of random insights computed as it emphasises for redundancy continuing, and provides the ideal outcome for the picked rule. In the event that the comparing choice is initiated in the Charts tab, an outline demonstrating the advancement of the picked foundation as the emphases continue is shown.
• Variance decay for the ideal arrangement: This is a table which demonstrates the inside class change between the class difference and the aggregate fluctuation.
• Class centroids: This is a table which demonstrates the class centroids for different descriptors.
• Distance between class centroids: This is a table which demonstrates Euclidean separations among the class centroids for different descriptors.
• Central objects: This is a table which demonstrates the directions of the closest which questions the centroid for every class.
• Distance between the focal articles: This is a table which demonstrates the Euclidean separations between the class focal items for the different descriptors.
• Results by class: The expressive measurements for the classes (number of articles, aggregate of weights, inside class change, least separation to the centroid, most extreme separation to the centroid, mean separation to the centroid) are shown in the initial segment of the table. The second part demonstrates the items.
• Result by question: This is a table which demonstrates the task class for every single protest in arranged items.
Result of BBC Sports Matrix
 Statistics’ Summary: Variable Observations Obs. with missing data Obs. without missing data Minimum Maximum Mean Std. deviation 7 9 0 9 0.000 0.000 0.000 0.000 1 9 0 9 0.000 0.000 0.000 0.000 3 9 0 9 0.000 1.000 0.222 0.441 2 9 0 9 0.000 0.000 0.000 0.000 4 9 0 9 0.000 1.000 0.333 0.500 2 9 0 9 0.000 1.000 0.333 0.500 Optimization summary: Repetition Iteration Initial within-class variance Final within-class variance ln(Determinant(W)) 1 1 0.750 0.583 -Inf 2 1 0.938 0.375 -Inf 3 1 0.708 0.250 -Inf 4 1 1.000 0.333 -Inf 5 1 0.458 0.333 -Inf 6 1 0.708 0.375 -Inf 7 1 0.667 0.250 -Inf 8 1 0.750 0.375 -Inf 9 1 1.000 0.250 -Inf 10 1 0.875 0.250 -Inf Statistics for each iteration: Iteration Within-class variance Trace(W) ln(Determinant(W)) Wilks' Lambda 0 0.750 3.000 -Inf 0.000 1 0.583 2.333 -Inf 0.000 Variance decomposition for the optimal classification: Absolute Percent Within-class 0.583 84.00% Between-classes 0.111 16.00% Total 0.694 100.00% Initial class centroids: Class 7 1 3 2 4 2 1 0.000 0.000 1.000 0.000 0.500 0.500 2 0.000 0.000 0.000 0.000 0.500 0.500 3 0.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.000 0.000 0.000 0.000 0.000 5 0.000 0.000 0.000 0.000 0.000 0.000 Class centroids: Class 7 1 3 2 4 2 Sum of weights Within-class variance 1 0.000 0.000 1.000 0.000 0.500 0.500 2.000 1.000 2 0.000 0.000 0.000 0.000 0.667 0.667 3.000 0.667 3 0.000 0.000 0.000 0.000 0.000 0.000 2.000 0.000 4 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 5 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 Distances between the class centroids: 1 2 3 4 5 1 0 1.027 1.225 1.225 1.225 2 1.027 0 0.943 0.943 0.943 3 1.225 0.943 0 0.000 0.000 4 1.225 0.943 0.000 0 0.000 5 1.225 0.943 0.000 0.000 0 Central objects: Class 7 1 3 2 4 2 1 (0) 0.000 0.000 1.000 0.000 0.000 0.000 2 (0) 0.000 0.000 0.000 0.000 1.000 1.000 3 (0) 0.000 0.000 0.000 0.000 0.000 0.000 4 (0) 0.000 0.000 0.000 0.000 0.000 0.000 5 (0) 0.000 0.000 0.000 0.000 0.000 0.000 Distances between the central objects: 1 (0) 2 (0) 3 (0) 4 (0) 5 (0) 1 (0) 0 1.732 1.000 1.000 1.000 2 (0) 1.732 0 1.414 1.414 1.414 3 (0) 1.000 1.414 0 0.000 0.000 4 (0) 1.000 1.414 0.000 0 0.000 5 (0) 1.000 1.414 0.000 0.000 0 Result based on class: Class 1 2 3 4 5 Objects 2 3 2 1 1 Sum of weights 2 3 2 1 1 Within-class variance 1.000 0.667 0.000 0.000 0.000 Minimum distance to centroid 0.707 0.471 0.000 0.000 0.000 Average distance to centroid 0.707 0.654 0.000 0.000 0.000 Maximum distance to centroid 0.707 0.745 0.000 0.000 0.000 0 0 0 0 0 1 0 0 0 Results by object: Observation Class Distance to centroid 0 1 0.707 0 2 0.745 0 3 0.000 0 4 0.000 0 5 0.000 0 3 0.000 0 2 0.745 1 1 0.707 0 2 0.471
Result of BBC Sports classes
 Statistics’ Summary: Variable Observations Observation with the missing data Observation without the missing data Minimum Maximum Mean Std. deviation 0 5 0 5 0.000 0.000 0.000 0.000 Summary of Optimization: Repetitions Iterations starting within-class variance Final within-class variance ln (Determinant(W)) 1 1 0.000 0.000 -Inf 2 1 0.000 0.000 -Inf 3 1 0.000 0.000 -Inf 4 1 0.000 0.000 -Inf 5 1 0.000 0.000 -Inf 6 1 0.000 0.000 -Inf 7 1 0.000 0.000 -Inf 8 1 0.000 0.000 -Inf 9 1 0.000 0.000 -Inf 10 1 0.000 0.000 -Inf Statistics for every single iteration: Iteration Within-class variance Trace (W) Ln (Determinant(W)) Wilks' Lambda 0 0.000 0.000 -Inf 0.000 1 0.000 0.000 -Inf 0.000 For optimal classification, variance decomposition: Absolute % Within-class 0.000 0.00% Between the classes 0.000 0.00% SUM 0.000 100.00% Initial class centroids: Class 0 1 0.000 2 0.000 3 0.000 4 0.000 5 0.000 Class centroids: Class 0 Sum of weights Within-class variance 1 0.000 1.000 0.000 2 0.000 1.000 0.000 3 0.000 1.000 0.000 4 0.000 1.000 0.000 5 0.000 1.000 0.000 Distances between the class centroids: 1 2 3 4 5 1 0 0.000 0.000 0.000 0.000 2 0.000 0 0.000 0.000 0.000 3 0.000 0.000 0 0.000 0.000 4 0.000 0.000 0.000 0 0.000 5 0.000 0.000 0.000 0.000 0 Central objects: Class 0 1 (0) 0.000 2 (0) 0.000 3 (0) 0.000 4 (0) 0.000 5 (0) 0.000 Distances between the central objects: 1 (0) 2 (0) 3 (0) 4 (0) 5 (0) 1 (0) 0 0.000 0.000 0.000 0.000 2 (0) 0.000 0 0.000 0.000 0.000 3 (0) 0.000 0.000 0 0.000 0.000 4 (0) 0.000 0.000 0.000 0 0.000 5 (0) 0.000 0.000 0.000 0.000 0 Result based on class: Classes 1 2 3 4 5 Objects 1 1 1 1 1 Sum of weights 1 1 1 1 1 Within-class variance 0.000 0.000 0.000 0.000 0.000 Minimum distance to centroid 0.000 0.000 0.000 0.000 0.000 Average distance to centroid 0.000 0.000 0.000 0.000 0.000 Maximum distance to centroid 0.000 0.000 0.000 0.000 0.000 0 0 0 0 0 Results by object: Observation Class Distance to centroid 0 1 0.000 0 2 0.000 0 3 0.000 0 4 0.000 0 5 0.000
Means clustering

Repeat K-means are provided in the below result.

For BBC Sports Matrix

 Statistics’ Summary: Variable Observations Observations with missing data Observations without missing data Minimum Maximum Mean Std. deviation 0 9 0 9 0.000 1.000 0.333 0.500 0 9 0 9 0.000 1.000 0.111 0.333 0 9 0 9 0.000 2.000 0.556 0.882 0 9 0 9 0.000 2.000 0.667 0.866 0 9 0 9 0.000 0.000 0.000 0.000 0 9 0 9 0.000 0.000 0.000 0.000 Optimization summary: Repetition Iteration Initial within-class variance Final within-class variance ln(Determinant(W)) 1 1 1.958 0.375 -Inf 2 1 2.875 0.300 -Inf 3 1 2.583 0.125 -Inf 4 1 2.500 0.833 -Inf 5 1 1.688 0.125 -Inf 6 1 2.438 0.500 -Inf 7 1 2.792 0.750 -Inf 8 1 2.833 0.125 -Inf 9 1 2.500 0.500 -Inf 10 1 2.875 0.125 -Inf Statistics for each iteration: Iteration Within-class variance Trace(W) ln(Determinant(W)) Wilks' Lambda 0 1.958 7.833 -Inf 0.000 1 0.375 1.500 -Inf 0.000 Variance decomposition for the optimal classification: Absolute Percent Within-class 0.375 19.85% Between-classes 1.514 80.15% Total 1.889 100.00% Initial class centroids: Class 0 0 0 0 0 0 1 0.333 0.000 0.000 0.667 0.000 0.000 2 0.000 0.000 0.000 0.000 0.000 0.000 3 1.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.500 1.500 1.500 0.000 0.000 5 0.500 0.000 1.000 0.500 0.000 0.000 Class centroids: Class 0 0 0 0 0 0 Sum of weights Within-class variance 1 1.000 0.000 0.000 0.000 0.000 0.000 3.000 0.000 2 0.000 0.000 0.000 0.000 0.000 0.000 2.000 0.000 3 0.000 0.000 0.000 2.000 0.000 0.000 1.000 0.000 4 0.000 0.000 2.000 1.000 0.000 0.000 1.000 0.000 5 0.000 0.500 1.500 1.500 0.000 0.000 2.000 1.500 Distances between the class centroids: 1 2 3 4 5 1 0 1.000 2.236 2.449 2.398 2 1.000 0 2.000 2.236 2.179 3 2.236 2.000 0 2.236 1.658 4 2.449 2.236 2.236 0 0.866 5 2.398 2.179 1.658 0.866 0 Central objects: Class 0 0 0 0 0 0 1 (0) 1.000 0.000 0.000 0.000 0.000 0.000 2 (1) 0.000 0.000 0.000 0.000 0.000 0.000 3 (0) 0.000 0.000 0.000 2.000 0.000 0.000 4 (0) 0.000 0.000 2.000 1.000 0.000 0.000 5 (0) 0.000 0.000 1.000 1.000 0.000 0.000 Distances between the central objects: 1 (0) 2 (1) 3 (0) 4 (0) 5 (0) 1 (0) 0 1.000 2.236 2.449 1.732 2 (1) 1.000 0 2.000 2.236 1.414 3 (0) 2.236 2.000 0 2.236 1.414 4 (0) 2.449 2.236 2.236 0 1.000 5 (0) 1.732 1.414 1.414 1.000 0 Results by class: Class 1 2 3 4 5 Objects 3 2 1 1 2 Sum of weights 3 2 1 1 2 Within-class variance 0.000 0.000 0.000 0.000 1.500 Minimum distance to centroid 0.000 0.000 0.000 0.000 0.866 Average distance to centroid 0.000 0.000 0.000 0.000 0.866 Maximum distance to centroid 0.000 0.000 0.000 0.000 0.866 0 1 0 0 0 0 0 1 0 Results by object: Observation Class Distance to centroid 0 1 0.000 1 2 0.000 0 2 0.000 0 1 0.000 0 1 0.000 0 3 0.000 0 4 0.000 0 5 0.866 1 5 0.866
References

Aggarwal, C. and Reddy, C. (2016). Data clustering.

Celebi, M. (2016). Partitional clustering algorithms. [S.l.]: Springer International Pu.

Kaushik, S. (2016). An Introduction to Clustering & different methods of clustering. [online] Analytics Vidhya. Available at: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/ [Accessed 24 Aug. 2018].

Cite This Work

My Assignment Help. (2021). Clustering And K-Means Algorithm With BBC Sports Dataset. Retrieved from https://myassignmenthelp.com/free-samples/sit720-machine-learning/partitional-clustering-algorithms.html.

"Clustering And K-Means Algorithm With BBC Sports Dataset." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/sit720-machine-learning/partitional-clustering-algorithms.html.

My Assignment Help (2021) Clustering And K-Means Algorithm With BBC Sports Dataset [Online]. Available from: https://myassignmenthelp.com/free-samples/sit720-machine-learning/partitional-clustering-algorithms.html
[Accessed 25 July 2024].

My Assignment Help. 'Clustering And K-Means Algorithm With BBC Sports Dataset' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/sit720-machine-learning/partitional-clustering-algorithms.html> accessed 25 July 2024.

My Assignment Help. Clustering And K-Means Algorithm With BBC Sports Dataset [Internet]. My Assignment Help. 2021 [cited 25 July 2024]. Available from: https://myassignmenthelp.com/free-samples/sit720-machine-learning/partitional-clustering-algorithms.html.

Get instant help from 5000+ experts for

Writing: Get your essay and assignment written from scratch by PhD expert

Rewriting: Paraphrase or rewrite your friend's essay with similar meaning at reduced cost