Data Analytics Basics: Techniques & Association Mining

Fundamentals of Data Analytics: Techniques and Association Rules Mining

Background

Data Analytics/Fundamentals of Data Analytics Student Name: Date: 1stApril 2021 1. BACKGROUND The purpose of this assignment is to implement the data distance techniques and the Association Rules Mining using the Apriori and FP-Growth algorithms for the frequent item sets. 2. QUESTIONS Question 1: (20points) Given two objects represented by two tuples: Tuple 1is your date of birth in the format (first digit Year, last digit Year, MM, DD). For example, if you born 2000/03/23 then the tuple will be (2, 0, 3, 23). Tuple 2is your student ID number (2nd digit, 7thdigit, 8thdigit, 9thdigit) For example, if your ID is 359000403 the tuple will be (5,4,0,3) A) Compute the Euclidean distance between the two tuples-(10points) Solution Ã¯Â¿Â½Ã¢Â¸Â´Ã£Â¤Â°Ã¥ÂÂ Ã ÂµÂ Ã¯Â¿Â½Ãâ€¦Ã¯Â¿Â½Ã¥ÂÂ Ã¯Â¿Â½ÃÂÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½Ãâ€¦Ã¯Â¿Â½Ãâ€¦Ã¯Â¿Â½Ãâ€¦Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ãâ€¦Ã¯Â¿Â½Ãâ€¦ÃÅ Ãâ€¦Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¢Â¸Â´Ãâ€¦Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½tÃ¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½tÃ¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½tÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ÃÅ tÃ¯Â¿Â½ Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã£ÂÂ³ B) Compute the Manhattan distance between the two tuples –(10points) Solution Ã¯Â¿Â½ Ã¢Â¸Â´Ãâ€¦Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½tÃ¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½tÃ¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½tÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ÃÅ tÃ¯Â¿Â½ ÃÅ Ã£ÂÂ³Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ Question 2: (20points) Using A priori Algorithm find the itemset with two or more items that have aminimum support of 50%. Transaction Table Transaction Id Itemsets T1 Apple, Lemon, Pineapple T2 Orange, Apple, Mango, Tomato T3 Apple, Lemon, Tomato, Cucumber T4 Apple, Tomato, Pen, Orange Implement the Apriori algorithm and present all the steps for your answer. Solution At k=1 Item set Sub Count Apple (I1) 4 Lemon(I2) 2 Pineapple(I3) 1 Orange(I4) 2 Mango (I5) 1 Tomato(I6) 3 Cucumber(I7) 1 Pen(I8) 1 At k=2 General candidates using C2using L1(joint step) condition of joining step Lk-1 and Lk-1 (k-2) elements. We then check all the subsets of item set tat are frequent and those that are not frequent are removed. Item Set Sub count I1, I2 2 I1, I3 1 I1, I4 2 I1, I5 1 I1, I6 3 I1, I7 1 I1, I8 1 I2, I3 1 I2, I4 0 I2, I5 0 I2, I6 1 I2, I7 1 I2, I8 0 I3, I4 1 I3, I5 0 I3, I6 0 I3, I7 0 I3, I8 0 I4, I5 1 I4, I6 2 I4, I7 0 I4, I8 1 I5, I6 1 I5, I7 0 I5, I8 0 I6, I7 1 I7, I8 0 Item Set Sub count I1, I2 2 I1, I3 1 I1, I4 2 I1, I5 1 I1, I6 3 I1, I7 1 I1, I8 1 I2, I3 1 I2, I6 1 I2, I7 1 I3, I4 1 I4, I5 1 I4, I6 2 I4, I8 1 I5, I6 1 I6, I7 1 The frequent items, {I1, I2, I4, I6} for {(I1,I2), (I1, I4), (I1, I6), (I4, I6)} Item Set Sub Count Apple, Lemon, Orange, Tomato 2 Apple, Tomato 3 ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½Ã¥ÂÂ Ã¯Â¿Â½Ã¥ÂÂ Ã¯Â¿Â½ÃÂÃ¥ÂÂ Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¥ÂÂ ÃÂÃ¥ÂÂ Ã¯Â¿Â½Ã¯Â¿Â½ The best set is that for Apple and Tomato Question 3: (10 points) Using the transaction table Transaction Id Itemsets T1 Desktop, Mouse, Keyboard, Monitor T2 Laptop, Keyboard T3 Keyboard, Mouse, Monitor T4 Desktop, Monitor T5 Laptop, Keyboard, Mouse From the above table what is the support, confidence and lift of the following association rules: 1) Keyboard->Mouse (2 points) Solution Ã¯Â¿Â½o Ã¯Â¿Â½Ã¯Â¿Â½ÃÂ ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ ÃÅ thÃ¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã ÂµÂÃ¢Â¸Â´Ã¢Â¸Â´Ã¯Â¿Â½Ã¥ÂÂ Ã ÂµÂÃ¯Â¿Â½Ã¥ÂÂ ÃÅ Ã¯Â¿Â½ Ã£ÂÂ³ÃÅ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½thÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ Ã£ÂÂ³ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ 2) Laptop->Keyboard (3 points) Solution Ã¯Â¿Â½oiitsÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½ t ÃÅ Ã£ÂÂ³Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½tÃ¤ÂÂ htÃ¨â‚¬â‚¬Ã¤ÂÂ Ã¤ÂÂ Ãªâ‚¬â‚¬Ã¤ÂÂ ÃÅ Ã¯Â¿Â½o Ã¯Â¿Â½Ã¯Â¿Â½ÃÂ Ã¯Â¿Â½o ÃÂÃ¯Â¿Â½ Ãâ€¦ Ã¯Â¿Â½Ã¥ÂÂ Ã¤ÂÅ¾Ãâ€¦Ã¯Â¿Â½oÃ¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½Ã¥ÂÂ Ã¤ÂÅ¾Ãâ€¦Ã¯Â¿Â½oÃ¯Â¿Â½Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã£ÂÂ³ Ã¯Â¿Â½Ã¯Â¿Â½h ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½thÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã£ÂÂ³ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã£ÂÂ³ Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ 3) Desktop->Laptop. (3 points) Solution Ã¯Â¿Â½o Ã¯Â¿Â½Ã¯Â¿Â½ÃÂÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã ÂµÂÃ¢Â¸Â´Ã¢Â¸Â´Ã¯Â¿Â½Ã¥ÂÂ Ã ÂµÂÃ¯Â¿Â½Ã¥ÂÂ ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½h ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½Ã¢Â¸Â´Ã¢Â¸Â´ÃÂÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ 4) Laptop->Mouse. (2 points) Solution Ã¯Â¿Â½oiitsÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½ t ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½tÃ¤ÂÂ htÃ¨â‚¬â‚¬Ã¤ÂÂ Ã¤ÂÂ Ãªâ‚¬â‚¬Ã¤ÂÂ ÃÅ Ã¯Â¿Â½o Ã¯Â¿Â½Ã¯Â¿Â½ÃÂ Ã¯Â¿Â½o ÃÂÃ¯Â¿Â½ Ãâ€¦ Ã¯Â¿Â½Ã¯Â¿Â½oÃ¯Â¿Â½Ã¥ÂÂ Ã¯Â¿Â½Ã¯Â¿Â½oÃ¯Â¿Â½Ã¥ÂÂ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½thÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½hÃ¯Â¿Â½Ã¯Â¿Â½ Based on the results provide the meaning of the three metrics for each association rule. The support states how popular an item set is and, in that case, keyboard mouse is the most popular combination while laptop desktop is the least. From the confidence values, it is highly likely that keyboard be purchased when amouse is purchased for confidence and lift value respectively. Question 4: (20points) The following contingency table summarizes supermarket transactions data, where {burgers} refers to the transactions containing burgers, {no burgers} refers to the transactions containing no burgers, {tomatoes} refers to the transactions containing tomatoes, {no tomatoes} refers to the transactions containing no tomatoes. Tomatoes no Tomatoes Sum row Burgers 4000 200 4200 no Burgers 2000 1500 3500 Sum column 6000 1700 7700 Based on the given data, is the purchase of tomatoes independent of the purchase of burgers? If not, what kind of correlation relationship exists between the two? Solution Hypothesis Null Hypothesis (H 0): There does not exist arelationship between the purchase of tomatoes and burgers. Alternative Hypothesis (H A): There exist arelationship between the purchase of tomatoes and burgers. Tested at 5% significance. Tomatoes No Tomatoes Total Burgers 4000 200 4200 No Burgers 2000 1500 3500 Total 6000 1700 7700 Expected Values Tomatoes No Tomatoes Total Burgers 3272.727 927.2727 4200 No Burgers 2727.273 772.7273 3500 Total 6000 1700 7700 ÃÅ Ã£ÂÂ³Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½tÃ¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½tÃ¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½tÃ¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½tÃÅ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ Ã¯Â¿Â½ ÃÅ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½ ÃªÂâ€¢ Ã¯Â¿Â½ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã£ÂÂ³Ã¯Â¿Â½h The p-value is, ÃªÂâ€¢ Ã¯Â¿Â½ Ã¯Â¿Â½ ÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã£ÂÂ³Ã¯Â¿Â½hÃÅ Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½ ,avalue less than 5% significance level, we therefore reject the null hypothesis and conclude that the purchase of tomatoes was significantly associated with the purchase of burger. Question 5: (10 points) Explain the difference between Apriori and FP-Growth algorithms using for Market Basket Analysis. Answer There are several differences between FP-growth algorithm and Apriori Algorithm as presented in the table below. Apriori FP Growth This is an array-based algorithm FP is atree-based algorithm Applies the join and Prune technique Constructs frequent pattern trees that are conditional and aconditional base of patterns that satisfies minimum support It applies abreadth-first search Applies adepth first search Applies the level-wise approach Applies apattern growth approach Low candidate generation with an exponentially increasing runtime The run time in this case is linearly increases over tim. Question 6: (20points) Using the following transaction table: Transaction Id Itemsets T1 I1, I2, I5 T2 I2, I4 T3 I2, I3 T4 I1, I2, I4 T5 I1, I2, I3, I5 Find the frequent patterns item sets with minimum support 2using FP Growth algorithm, 1) Create the FP-Tree. (3points) Solution Transaction Id Itemsets TID Vems T1 I1, I2, I5 I1,I2 T2 I2, I4 I2 T3 I2, I3 I2 T4 I1, I2, I4 I1, I2 T5 I1, I2, I3, I5 I1, I2 2) Mining the FP tree by creating Conditional Pattern Base (3points) Solution TID Support I1 3 I2 5 I3 2 I4 2 I5 2 TID Support I2 5 I1 3 The tree TID Support N.L I2 5 I1 3 I1=3 Null I2:5 3) Create Conditional FP-tree (4 points) Tree 4) Generate the Frequent Patterns Generated. 10points) Solution Null I2:1 I1:1 1. I1, I2 Null I2:2 I1:2 2. I1, I2 Null I2:3 I1:3 3. I1, I2 I2:5 4. I2 Nul l I1: I2: I1: I13 I2: I2: 4 I2: I2:

Get instant help from 5000+ experts for