Data Mining Quiz

Question 1

Question 1
1.Which of the following is the best example of a data mining application?
a. An MIT professor proved a mathematic theorem that can help predict stock prices.
b. The Internal Revenue Service used a software system, built based on the historical records, to identity fraudulent tax returns.
c. Using Google, a retired mailman found his long lost high-school sweetheart.
2 points
Question 2
1.Which of the following is an example of using the Naïve Rule for predicting the outcome of the next Patriots-Jets game?
a. Patriots have beaten Jets eight times in their last ten meetings. Based on this information, I predict Patriots will win again when they meet next time.
b. Patriots have beaten Jets eight times in their last ten meetings. It is Jets’ turn now. I will pick Jets to win when they meet next time.

c. I will flip a coin – head for Patriots, and tail for Jets.

d. All of the above are using the Naïve Rule.
2 points
Question 3
1.A data-mining technique is used to analyze the CongressVote data. The results of the analysis show that there are three groups: the first primarily formed by democrats, the second basically by republican, and the third group contains mixed democrats and republican. Members of each group share some common characteristics. This type of data mining is called __________
a. classification.
b. numeric prediction.
c. clustering.
d. association.
2 points
Question 4
1.A data-mining technique is used to analyze the CongressVote data. The results of the analysis show some interesting patterns about voting. For example, if a House member (can be democrat or republican) voted for the ‘anti-satellite-test-ban’ bill, s/he typically voted against the ‘mx-missile’ bill. Also, if a member voted for the ‘physician-fee-freeze’ bill, s/he often also voted for the ‘education-spending’ bill. This type of data mining is called _____
a. classification.
b. numeric prediction.
c. clustering.
d. association.
2 points
Question 5
1.In Weka, the 10-fold cross-validation is the default setting for performance evaluation. Given a dataset of 200 records, can you perform a 200-fold cross-validation in Weka?
a. Yes.
b. No.
2 points
Question 6
1.You bought some Microsoft stocks last year. The price of the stock has increased. You want to use a data-mining technique to help decide whether you should hold your Microsoft stocks or sell them for profit. The data-mining task you are facing is _______
a. classification.
b. clustering.
c. association.
d. missing data estimation.
2 points
Question 7
1.Which of the following is NOT typically considered a big data analytics project?
a. A large online retail company (e.g., Amazon) attempts to identify potential buyers for certain products based on the customers’ purchase history and behavior, along with their online interactions with the customer service department of the company.
b. A group of researchers attempt to predict the U.S. population count in 2020, using several survey data taken from a few representative states in the last few years.
c. A group of researchers in a political institute attempt to predict the outcome of the 2020 US presidential election, using economic and survey data, as well as social media data.
2 points
Question 8
1.Which of the following is NOT generally considered a big data model or technology?
a. MapReduce.
b. Hadoop.
c. NoSQL.
d. All of the above are generally considered big data oriented technology.
2 points
Question 9
1.Suppose you have a dataset of 500 records. You would like to build a classification model and then evaluate the model using a 10-fold cross-validation procedure. How many records will be used for validation after you complete the 10-fold cross-validation procedure?
a. 50.
b. 100.
c. 500.
d. Unknown, because the records are randomly selected for validation.
2 points
Question 10
1.We learned that in a medical/health application, a ‘positive’ class is considered more important and should have a higher misclassification cost than a ‘negative’ class. Based on this consideration, which of the following statements is correct?
a. True positive rate is more important than false negative rate.
b. True negative rate is more important than false positive rate.
c. False positive rate is more important than false negative rate.
d. False negative rate is more important than false positive rate.
2 points
Question 11
1.We learned that in a medical/health application, a ‘positive’ class is considered more important and should have a higher misclassification cost than a ‘negative’ class. Which of the following methods can be used to take this cost consideration into account?
a. Under-sampling for the records with a ‘negative’ class.
b. Over-sampling for the records with a ‘positive’ class.
c. Assigning a weight to each record based on the cost of corresponding class.
d. All of the above.
2 points
Question 12
1.The values of an attribute in a dataset are: -1, 0, 2, ?, 2, where a ? represents a missing value. With the mean-substitution method, which of the following should be used to replace the missing value?
a. 0
b. 2/4
c. 3/4
d. 1
2 points
Question 13
1.An attribute called Profit in a small dataset has a total of five values: 200, -100, -200, 100, and 300. After normalization, the value of the second item (-100) is transformed to ____.
a. -0.1
b. 0.1
c. 0.2
d. 0.3
2 points
Question 14
1.Which of the following statements about a decision tree is correct?
a. The smaller the tree size, the smaller the training error.
b. The lager the tree size, the smaller the training error.
c. The smaller the tree size, the smaller the validation error.
d. The larger the tree size, the smaller the validation error.
2 points
Question 15
1.This question is related to the table below. What is the (non-normalized) Euclidean distance between Lisa and Mike, calculated based on Age and Income?
Age Income (in $1,000s)
Lisa 39 44
Mike 45 52
2.
3.
a. 8
b. 10
c. 12
d. 14
2 points
Question 16
1.When you look at a book from Amazon.com, you will often see a list of related books recommended by Amazon. Which of the following algorithms is most likely used by Amazon to get the recommended books?
a. Apriori algorithm.
b. CART algorithm.
c. Single linkage algorithm.
d. k-means algorithm.
2 points
Question 17
1.Which of the following statements is FALSE?
a. K-means clustering requires the number of clusters be specified before computation starts.
b. Hierarchical clustering (e.g., single linkage) does not require the number of clusters be specified before the hierarchy is computed.
c. In the k-means clustering computation, if a data point is assigned to a cluster, it will not be reassigned to another cluster.
d. In the hierarchical clustering computation, if a data point is assigned to a cluster, it will not be reassigned to another cluster.
2 points
Question 18
1.Which of the following statements about clustering error rate is correct?
a. The clustering error rate is the same as the classification error rate.
b. The clustering error rate is the same as the false positive rate.
c. The clustering error rate is the same as the false negative rate.
d. There is not a generally defined clustering error rate.
2 points
Question 19
1.If the “diapers and beer” story (which reveals the real motivation for men to buy diapers and beer) is true, then which of the following is most likely to occur?
a. confidence (men buy diapers => men buy beer) is the same as confidence (men buy beer => men buy diapers).
b. confidence (men buy diapers => men buy beer) is greater than confidence (men buy beer => men buy diapers).
c. confidence (men buy diapers => men buy beer) is smaller than confidence (men buy beer => men buy diapers).
2 points
Question 20
1.Which of the following statements about association rule ‘{X, Y} => Z’ (where X, Y, and Z are itemsets) is true?
a. ‘{X, Y} => Z’ implies ‘X causes Z’.
b. ‘{X, Y} => Z’ implies ‘X or Y causes Z’.
c. ‘{X, Y} => Z’ implies ‘X and Y cause Z’.
d. None of the above is true.
2 points
Question 21
1.Which of the following words is least likely to be a stop word?
a. stop
b. such
c. that
d. then
2 points
Question 22
1.Given a set of documents, a large tfidf(t, d) value for term t and document d suggests that
a. term t appears frequently in most documents.
b. term t appears rarely in any document.
c. term t appears frequently in document d but rarely in most of the remaining documents.
d. term t appears rarely in document d but frequently in most of the remaining documents.
2 points
Question 23
1.Which of the following data-mining problems is most similar to a regression problem?
a. Association rules
b. Classification
c. Clustering
2 points
Question 24
1.Which of the following is true about support vector machines (SVM)?
a. SVM classification models are sensitive to outliers in the training dataset.
b. SVM classification models are sensitive to data points near the decision boundary.
c. Both of the above is true.
2 points
Question 25
1.Which of the following is true about linear regression (LR) and support vector regression (SVR)?
a. LR almost always fits training data better than SVR.
b. LR almost always fits testing data better than SVR.
c. SVR almost always fits training data better than LR.
d. SVR almost always fits testing data better than LR.
2 points
Question 26
1.The following are the Weka decision tree outputs based on the BostonHousing2 data used in class. Answer questions (a), (b), (c) and (d) following the output screens.

Question 2

a.How many rules can be derived from the above decision tree model?
b.If you increase the minimum number of records required in a leaf, will the resulting tree generally be larger or smaller than the current one? Why? Explain it in one sentence.
c.Let’s call the tree shown in the output screens Tree 1. The confusion matrix based on Tree 1 is shown on the bottom of the first screen. The confusion matrix shown after question (d) below is obtained using another tree called Tree 2. Based on the results in the two confusion matrices, which tree is better? Show your calculations.
d.A real estate agent who focuses on low value homes is interested in using decision trees to classify the values of homes in this area. He considers misclassifying a ‘low’ value home as ‘high’ to be a more costly error, because it will be harder for him to sell an actual ‘low’ value home if he presents the ‘false high’ results to the seller. Therefore, he assigns a cost of 0.9 to such an error (and thus 0.1 for the cost of misclassifying a ‘high’ value home as ‘low’). With this set of costs, which tree model is better for him? Show your calculations.
Classified as 'low' Classified as 'high'
Actual 'low' 414 8
Actual 'high' 32 52

12 points
Question 27
1.Consider the TopUniversities dataset used in class, which has 25 records. The dendrogram below is generated by the single linkage algorithm in Weka based on the dataset. The subsequent two screens show the k-means clustering results using Weka, where k = 4 and the clusters are labeled from cluster 0 to cluster 3. Answer questions (a), (b) and (c) following the screens.

a.Suppose we would like to obtain 4 clusters using the dendrogram. How many records are contained in each of the 4 clusters?
b.Based on the k-Means results, which cluster has the lowest expenses? Which universities belong to this cluster? Note that the vertical axis of the second screen represents the cluster ID (cluster 0, cluster 1,…).
c.Which method, single linkage (with dendrogram) or k-means, is better for clustering this dataset? Explain the reason in one or two sentences.
12 points
Question 28
1.Consider again the TopUniversities data used in class. In addition to the existing attributes, U.S. News & World Report also provided rankings for the 25 universities. The rank order is the same as the position of the university in the dataset, e.g., Harvard is ranked #1, Princeton #2, …, and Texas A&M #25 (see the list on the horizontal axis in the last screen in Question 29). The first output screen below is generated by Weka’s SVR algorithm (SMOreg), using the university’s rank as the target attribute. Then, we replaced the numeric ranking attribute with a 2-class attribute by grouping the first 15 universities to class A and the remaining 10 universities to class B. Based on this grouped dataset, the second output screen is generated by Weka’s SVM algorithm (SMO) and the third output screen is generated by Weka’s decision tree algorithm (J48). Answer questions (a), (b), (c) and (d) following the output screens.

a.Based on the SVR model (first screen), what two attributes are the most important predictors?
b.Why is the coefficient of the AvgSAT attribute a negative number?
c.Based on the SVM model (second screen), What two attributes are the most important predictors?
d.Based on the decision tree model (third screen), what two attributes are the most important predictors?

Question 29

1.You are performing text mining on a customer review dataset containing 200 customer reviews. Answer the following questions:
a.Suppose each review was limited to no more than 50 words. In the term-document matrix, which dimension is more likely to be larger, the number of documents or the number of terms? Explain your choice in one sentence.
b.You are considering to use stemming or lemmatization for processing the review text. The term ‘increasing’ appeared in many reviews. What are the results of stemming and lemmatization of this term, respectively?
c.In addition to the review text data, each customer also provided a rating score, with 1-star representing poor and 5-star representing excellent. Suppose your text mining task is to predict ratings based on the customer reviews. Which of the three techniques below is NOT appropriate for your task? Choose only one answer.
(i) J48 decision tree algorithm
(ii) support vector regression
(iii) k-means algorithm6 points

Get instant help from 5000+ experts for