Clustering Public Utilities Data using Python Scikit-learn

CS 4435 Data Mining

Task:

Utilities.csv gives corporate data on 22 public utilities in the United States. We are interested in forming groups of similar utilities. The records to be clustered are the utilities, and the clustering will be based on the eight measurements of each utility. The features are:

Fixed = fixed-charge covering ratio (income/debt)
RoR = rate of return on capital
Cost = cost per kilowatt capacity in place
Load = annual load factor
Demand = peak kilowatt-hour demand growth from 1974 to 1975
Sales = sales (kilowatt-hour use per year)
Nuclear = percent nuclear
Fuel Cost = total fuel costs (cents per kilowatt-hour)

Please Load the data as a Panda dataframe, set row names (index) to the utilities column (company). Convert all columns to float.
1. a. Use “from sklearn.metrics import pairwise” and calculate the pairwise Euclidean distance between each pair of Utilities and show the distance matrix.
b. Standardize the features based on mean and std and recalculate the pairwise distance matrix using Euclidean distance.

For the rest of the tasks, use the normalized version of the dataframe.

2) a. Use “from scipy.cluster.hierarchy import linkage” and plot the Dendrogram using the Single linkage

b. plot the Dendrogram using Average linkage

c. use “from scipy.cluster.hierarchy import fcluster” and apply it to Dendrograms for both Single and Average linkages to separate the data points into 6 clusters and print the clusters with their corresponding members. (Set the criterion='maxclust' for the fcluster)

3) a. Use “from sklearn.cluster import KMeans” to cluster the data into 6 clusters. Set the random state for KMeans to “0”. Print the clusters and their members.

b. For the number of clusters from 1-7, plot the average SSE vs the number of clusters as a line plot. Use “intertia” attribute of KMeans to get the SSE. Make sure that you divide it by the number of clusters to get the average SSE.

Get instant help from 5000+ experts for