Comparison essay: Classification Methods for Data Mining.

Part 1 - Practical and Report
There are two steps to complete in this task:
Step 1: You are required to perform a data mining task to evaluate different classification algorithms. Load the vote.arff data set into Weka and compare the performance on this data set for three classification algorithms:
Decision Tree
Naive Bayes
k-Nearest Neighbour
Step 2: From step 1 outputs, write a report that shows the performance of the different algorithms and comment on their accuracy using the confusion matrix and other performance metrics used in Weka. In your report consider:
Is there a difference in performance between the algorithms?
Which algorithm performs best?
Your report should Include the necessary screenshots, tables, graphs, etc. to make your report understandable to the reader.
All diagrams that are required should be inserted into the document in the appropriate position.
Your answers to the questions should be precise but complete and informative.
Each question should be answered individually with the corresponding label to indicate the tasks completed e.g. Task 1 a.

K-NN Classifier

In classification of data the primary aim is to construct unique categories with proper justification. The categories are finalized based on specific properties of the data set and similarity properties of the variables. The error in classification signifies the accuracy of the data mining technique. The classification model is constructed based on the pre-processing of the data, where the variables are explored for validity and consistency. The classifiers are applied on the data set for segregating the variables in specific groups with label of class variables. The scholar focused the work on exploring the three classifiers techniques for yielding the optimum classification of voters under the Republican and Democrats. Predicting precision of K-NN classification, Bayesian Network, and Decision Tree were analyzed for the purpose of guiding future researchers in the field of data mining (Thornton, Hutter, Hoos, & Leyton-Brown, 2013).

There are certain assumptions of KNN classifier, which is an instance-based learning tool. The algorithm is based on the supposition of proximity of the samples to one another. In Weka environment KNN classifiers are denoted as lazy learners. Sample observations are not market by the algorithm and stored in the process, which construct a classification with the said mining structure. KNN classifier generally consumes less calculation time and is considered one of the efficient tools. The classification process learning phase is based on the knowledge of similarity of the sample observations. The nearest K neighbor is searched to assign the observation to a class where most of its neighbors fit in. The performance of the nearest k neighbor algorithm varies with choice of the index K (Jiang, Cai, Wang, & Jiang, 2007).

Bayesian independence model of conditional probabilities construct the Naive Bayes classifier tool. The classification of the Naïve classifier hypothesizes the correlation between the factors of the study. It also assumes the classification model with contribution of all the factors or variables. The classifier is a probabilistic classifier, which calculates the likelihood of a class membership. The class label likelihood for each and every factor is considered as a particular element. The classification technique is useful for high number of the inputs. The posterior probabilities are calculated by the Bayesian model on conditional probabilities.

The J48 classifier of the Decision Tree algorithm constructs the classification model or the tree based on objective value of the variables. The J48 algorithm uses its induction technique to search data by the help of the breadth-first and the depth-first algorithms. The diagrammatic representation makes the algorithm more elaborate for interpretation. The internal nodes, root nodes, and leaf nodes of a decision tree constructs a flow chart, where test condition for each node is denoted by an inner node. Branch of the tree represents the test result and it assigns class tag to each end node with the root node as the topmost node. Decision Tree is constructed by a top-down technique, where the structure is recursively split till the final stage (Bhargava, Sharma, Bhargava, & Mathuria, 2013).

Naive Bayes Classifier

KNN and Naïve Bayes are probabilistic classifiers, whereas Decision Tree is a deterministic technique. Decision Tree and Naïve Bayes are used for large set of data, and KNN is applied on a small data set. The fastest of the three is Decision Tree, followed by Naïve Bayes. KNN is known to be the slowest of the three. KNN classification cannot deal with outlier data in the dataset (Wahbeh, Al-Radaideh, Al-Kabi, & Al-Shawakfa, 2011). Naive Bayes and J48 are effective in dealing with noisy dataset. All the three classifiers are capable of classification with high accuracy. But, Naïve Bayes requires large data set of higher level of accuracy (Ramzan, 2016).

Weka (Waikato environment for knowledge analysis) data mining software was used and the three classification tools were applied for analyzing the Vote.arff dataset. The sample size of the dataset was sufficiently large for the current comparative investigation (Hall et al., 2009).

The data file contained voting information of 435 people, where the voters were oriented around two classes, Democrats and Republicans. Voting was taken from the respondents on sixteen attributes, where the classification was conducted for Democrats and Republicans. Some answers were found missing in all the columns of the dataset, implying that respondents intentionally avoided some of the questions. The ‘Replace Missing Value’ filter was opted for replacing the missing values by the measures of central tendency (Mode or Mean). From class analysis 267 Democrats and 168 Republicans were identified.

All the three classifiers were applied on the voting responses with 10 fold cross-validation. This technique is known to be effective in classifier performance analysis. Correctly classified, incorrectly classified instances along with confusion matrices were analyzed for Naïve Bayes, KNN, and J48 algorithms. The cost analysis for all the classification processes on voting choices was scrutinized (Amin, & Habib, 2015; Arora, 2012).

Confusion matrix terms are,

True Positive: Actual and the Predicted values are same.

False Positive: Actual result indicates presence of particular characteristic

Precision: Measure of precision and quality

Recall: Measure of totality and quality

The Bayes algorithm correctly classified 392 (P = 90.11%) instances and incorrectly classified 43 instances (P = 9.86%). From confusion matrix it was observed that for Democrats, 238 results were correctly predicted and for Republicans 154 results were correctly predicted. From detail accuracy matrix predicting precision for Democrats (Precision = 0.944) was higher compared to that of the Republicans (Precision = 0.842). From the cost analysis it was noted that classification accuracy was 38.62%, where correctly predicted Democrats were 267, and Republicans were 168.

Decision Tree Classifier

The KNN algorithm was initially applied for K=1 (default). With 93.56% accuracy classified instances were 407 (P = 93.56%), and incorrectly classified were 28 (P = 6.44%). In the confusion matrix correctly classified Democrats were 250, and correctly classified Republicans were 157. From the detail accuracy by class matrix precision for predicting Democrats (Precision = 0.95) was little higher than that of Republicans (Precision = 0.94). Later, k = 5 was used as the optimum value for the KNN classifier after verifying with other values. The algorithm correctly classified 409 (P = 94.02%) instances, and incorrectly specified 26 instances (P = 5.98%). In the cost analysis window it was noted that classification accuracy was 38.62%. There correctly predicted Democrats were 267 and that of Republicans were 168.

The Decision Tree was the last classifier to be applied on the voting preferences. The J48 classifier correctly classified 419 (P = 96.32%) instances, whereas 16 incorrectly classified instances were identified (P = 3.68%). From the confusion matrix it was noted that correctly predicted legitimate Democrats were 259 and that of the Republicans were 160. From the detail accuracy matrix 0.97 precision for predicting Democrats and 0.95 precision for predicting Republicans were noticed. From the cost analysis of J48 it was found that 267 Democrats were correctly predicted and 168 Republicans were correctly identified based on the responses (Drazin, & Montag, 2012).

In the present study correctly classified Instances for KNN (k=5) classifier was 94.02%, for J48 classifier was 96.32%, and for IBK Lazy was 90.11%. Incorrectly classified instances for KNN (k=5) classifier was 5.98%, for J48 classifier was 3.68%, and for IBK Lazy was 9.89%. Relative Error for KNN (k=5) classifier was 46.88%, for J48 classifier was 38.08%, and for IBK Lazy was 61.91%. From confusion matrix True Positive for KNN (k=5) classifier was 25, for J48 classifier was 259, and for IBK Lazy was 236. True Negative for KNN (k=5) classifier was 158, for J48 classifier was 160, and for IBK Lazy was 154. Hence, accuracy and performance wise the best classifier was identified as the J48 (Decision Tree) classifier (Patil, & Sherekar, 2013).

Conclusion

All the three classifiers predicted the correctly identified instances with high accuracy. The research results implied that all the algorithms performed well with low error rates. The cost analysis comprehension for the classifiers indicated equal prediction capability, which was evident from the confusion matrix of cost analysis of the classification tools. But, based on the prediction capability the Decision Tree (J48) was identified as the most efficient classifier (Kaur, & Chhabra, 2014). The tree diagram was a helping tool in analyzing the output of the classifier. Comparative results from Weka for identifying Democrats and Republicans showed that the J48 tool results were better compared to Naïve Bayes and KNN classifications. Accuracy of KNN was higher than that of Naïve Bayes. Though the performance of Naïve Bayes was the lowest, in some cases the classifier could achieve higher accuracy for particular composite data set. Due to size of the dataset, time taken to construct the entire three classification model was zero second. Future exploration of the three classifiers for a complex data set with more constraints could yield different results (Salama, Abdelhalim, & Zeid, 2012).

References

Amin, M. N., & Habib, M. A. (2015). Comparison of different classification techniques using WEKA for hematological data. American Journal of Engineering Research, 4(3), 55-61.

Arora, R. (2012). Comparative analysis of classification algorithms on different datasets using WEKA. International Journal of Computer Applications, 54(13).

Bhargava, N., Sharma, G., Bhargava, R., & Mathuria, M. (2013). Decision tree analysis on j48 algorithm for data mining. Proceedings of International Journal of Advanced Research in Computer Science and Software Engineering, 3(6).

Drazin, S., & Montag, M. (2012). Decision tree analysis using weka. Machine Learning-Project II, University of Miami, 1-3.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18.

Jiang, L., Cai, Z., Wang, D., & Jiang, S. (2007, August). Survey of improving k-nearest-neighbor for classification. In Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on (Vol. 1, pp. 679-683). IEEE.

Kaur, G., & Chhabra, A. (2014). Improved J48 classification algorithm for the prediction of diabetes. International Journal of Computer Applications, 98(22).

Patil, T. R., & Sherekar, S. S. (2013). Performance analysis of Naive Bayes and J48 classification algorithm for data classification. International journal of computer science and applications, 6(2), 256-261.

Ramzan, M. (2016, August). Comparing and evaluating the performance of WEKA classifiers on critical diseases. In Information Processing (IICIP), 2016 1st India International Conference on (pp. 1-4). IEEE.

Salama, G. I., Abdelhalim, M. B., & Zeid, M. A. E. (2012, November). Experimental comparison of classifiers for breast cancer diagnosis. In Computer Engineering & Systems (ICCES), 2012 Seventh International Conference on (pp. 180-185). IEEE.

Thornton, C., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2013, August). Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 847-855). ACM.

Wahbeh, A. H., Al-Radaideh, Q. A., Al-Kabi, M. N., & Al-Shawakfa, E. M. (2011). A comparison study between data mining tools over some classification methods. International Journal of Advanced Computer Science and Applications, 8(2), 18-26.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2021). Comparative Analysis Of Classification Techniques For Data Mining Essay.. Retrieved from https://myassignmenthelp.com/free-samples/comp3340-data-mining/comparing-and-evaluating.html.

"Comparative Analysis Of Classification Techniques For Data Mining Essay.." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/comp3340-data-mining/comparing-and-evaluating.html.

My Assignment Help (2021) Comparative Analysis Of Classification Techniques For Data Mining Essay. [Online]. Available from: https://myassignmenthelp.com/free-samples/comp3340-data-mining/comparing-and-evaluating.html
[Accessed 27 July 2024].

My Assignment Help. 'Comparative Analysis Of Classification Techniques For Data Mining Essay.' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/comp3340-data-mining/comparing-and-evaluating.html> accessed 27 July 2024.

My Assignment Help. Comparative Analysis Of Classification Techniques For Data Mining Essay. [Internet]. My Assignment Help. 2021 [cited 27 July 2024]. Available from: https://myassignmenthelp.com/free-samples/comp3340-data-mining/comparing-and-evaluating.html.