Question 1. Consider the following data set containing the age and serum creatine (sc) levels for a set of people: (16 points)
Â
age 23 Â 23 Â Â 27 Â 27 Â 39 Â 41 Â Â 47 Â Â 49 Â Â 50 Â Â 52 Â Â 54 Â Â 54 Â Â 56 Â Â 57 Â Â 58 Â Â 58 Â Â 60 Â 61
sc 9.5 26.5 7.8 17.8 31.4 Â 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7Â
a) Calculate the mean, median and standard deviation of age and creatine level. (6 points)
b) Draw a scatter plot of these two variables. (4 points)
c) Normalize the two variables based on the z-score normalization technique. (6 points)
Â
Question 2. Consider data for analysis that includes the attribute length whose recorded values are: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (10 points)
Â
 a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps. (8 points)Â
 b) Discuss how you would determine if there are outliers in the data. (2 points)
Â
Question 3. Consider the data set D = {(1.5, 1.7), (2, 1.9), (1.6, 1.8), (1.2, 1.5), (1.5, 1.0)}, where each element is a two-dimensional point in the Euclidian space. (18 points)
 a) Given a new data point, (1.4, 1.6), rank/order the points in D based on their similarity to the new point using as the similarity measure: (8 points)
⢠ a.1. the Euclidean distance
⢠ a.2. the cosine similarity
Â
b) Normalize the data set, including point (1.4, 1.6), to make the Euclidian norm of each data point equal to 1. Rank/order the transformed/normalized points based on their similarity to the normalized (1.4, 1.6) using the Euclidean distance as a similarity measure. (10 points)
Â
Question 4. Design an algorithm, and describe it in pseudocode, for the automatic generation of a concept hierarchy for categorical data based on the number of distinct values of the attributes in a given schema. Describe how an arbitrary schema would be represented in your framework and how the algorithm would generate a concept hierarchy for categorical data based on the number of distinct values of attributes in the given schema. (6 points)
Â
Question 5. Consider the following data set containing information about participants in an online test. The dataset contains, for each participant, their age and the number of minutes it took them to complete the test: (12 points)
Â
age 20 Â 24 Â 32 Â 38 Â 44 Â 46 Â 47 Â 49 Â 50 Â
test duration 27 Â 30 Â 29 Â 23 Â 25 Â 25 Â 30 Â 34 Â 32 Â
Â
(a) Calculate the median and standard deviation of the test duration variable. (6 points)
(b) Normalize the test duration variable based on the z-score normalization technique. (6 points)
Â
Question 6. Consider the following data set: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46. Use smoothing by bin median to smooth the above data, using a bin depth of 5. Illustrate your steps. [10 points]
Â
Question 7. What is Cluster Analysis?
Â
Question 8. What is data ?
⢠How is data structured ?
⢠How to use basic statistical descriptions to study/infer data characteristics ?
Â
Question 9. When to do data pre?processing ?