Exploring Density-Based Subspace Clustering

Consider the density-based subspace clustering. The size of a subspace is defined to be the total number of dimensions for this subspace. For example, subspace {A, B} is of size 2. For each single dimension, the number of grid units is fixed to a constant c where c is a positive integer greater than 1.

(a) In class, we learnt that the major idea in the KL-transform is to transform the original coordinate system to a new coordinate system such that we could find clusters in subspaces from the new coordinate system. Suppose that we use the KL-transform to transform all data points from the original coordinate system to a new coordinate system (without using Step 6 of the KL-transform (i.e., choosing a subset of attribute values)). Then, all points are now represented in the new coordinate system. Based on the new coordinate system, we adopt the density-based subspace clustering to find clusters in some subspaces. Is it always true that the total number of grid units involved in all clusters based on the new coordinate system is smaller than that based on the original coordinate system? If yes, please give some justifications without any formal proof. If no, similarly, please give some counter examples for illustration.

(b) When the size of the subspace is larger, it is less likely that a grid unit with respect to the subspace is dense. Please explain it.

(c) In order to overcome the weakness described in (b), instead of setting a fixed density threshold for the subspace of any size, we use a smaller density threshold for the subspace of larger size. Specifically, let Ti be the density threshold for the subspace of size i. If i < j, then Ti > Tj. Let Condition 1 be “Ti > Tj for

any i < j”. Let Condition 2 be “for any i and j, Ti = Tj”. We know that if Condition 2 is satisfied, then the original Apriori-like algorithm studied in class can find all subspaces containing dense units.

(i) Under Condition 1, is it always true that we can still adopt the Apriori-like algorithm? If yes, please describe how to adopt the algorithm. Otherwise, please give reasons why it cannot be adopted.

(ii) Suppose that we modify Condition 1 to the following form. Let Condition 1 be “Ti = cTi+1 for each positive integer i”. Assume that we adopt this new form of Condition 1. Under this new form of Condition 1, is it always true that we can still adopt the Apriori-like algorithm? If yes, please describe how to adopt the algorithm. Otherwise, please give reasons why it cannot be adopted. 2/6

(a) Consider a set P containing the following four 2-dimensional data points. a:(6, 6), b:(8, 8), c:(5, 9), d:(9, 5) We can make use of the KL-Transform to find a transformed subspace containing a cluster. Let L be the total number of dimensions in the original space and K be the total number of dimensions in the projected subspace. Please illustrate the KL-transform technique with the above example when L=2 and K=1.

(b) Consider a set Q containing the following four 2-dimensional data points. e:(5, 5), f:(9, 9), g:(3, 11), h:(11, 3)

(i) Let p = (xp, yp) be a point in P and q = (xq, yq) be a point in Q. In fact, we could express xq in a linear form involving xp such that xq = α . xp + β where α and β are 2 real numbers. Similarly, we could express yq in the same linear form involving yp. Please write down the values of α and β.

(ii) Similar to Part (a), we want to make use of the KL-Transform to find a transformed subspace containing a cluster for the set Q where L = 2 and K = 1. One “straightforward” or “naïve” method is to use the same method in Part (a) to obtain the answer. Is it possible to make use of the result in Part (a) and the result in Part (b)(i) to obtain the answer very quickly? If yes, please explain briefly and give the answer. There is no need to give a formal proof. A brief description it accepted. If no, please give an explanation briefly. In this case, derive the answer by using the method in Part (a).

(c) Consider Part (a). It is independent of Part (b). In Part (a), we know that there are 4 points. Suppose that we have 4 additional points which are identical to the original 4 points. That is, we have the following 4 additional points. Totally, we have 8 data points. (6, 6), (8, 8), (5, 9), (9, 5) One “straightforward” or “naïve” method is to use the same method in Part (a) to obtain the answer. Is it possible to make use of the result in Part (a) to obtain the answer very quickly? If yes, please explain briefly and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,please give an explanation briefly. In this case, derive the answer by using the method in Part (a).

Get instant help from 5000+ experts for