Data discretization improves model performance

Preprocessing of the Dataset

The question I will be investigating was: “Does data discretization improve the performance of the model?” Discretization is the placing of values into buckets to create limited states.

The question is interesting because I hypothesis that when data is grouped into buckets or bins, algorithm models can identify patterns and trends within the data easily than when the data is not grouped. It will be interesting to find out the answer to the question: “Does discretization improve model performance?” Buckets are treated in data modeling tools like WEKA are treating as ordered and discrete values. Each bucket is treated as a value. One data set will be saved without being discretized while another data set will be saved after being discretized. The applied the classifiers: J48 (decision tree), Naïve Bayes, and support vector machine to determine performance.

The first step was to load the text dataset data into WEKA. To do this I opened WEKA and clicked on explorer and selected the practical dataset provided. Then chose “text directory loader” and loaded the data as text. At this point, the text was unviable in WEKA but the classes were visible. The entertainment class was the minority class with 386 text files and the sport was the majority class with 511 texts. The class attribute had 2 distinct values (386 and 511) and 0 unique values. Even though the attributes in the text attribute were not visible at this point, the text attribute had 874 distinct values and 815 unique values. To avoid the normalization problem, I applied under sampling to reduce the majority class (sport) to have the same text files as the minority class (entertainment). To do this, I applied the “spread sample” filter from the supervised-instance category and set maxCount to 386. This eliminated the problem of the same word having the same frequency in both files but less weight or importance in the files with majority texts. In the next step I assigned the attribute class to class using the “ClassAssigner” filter from the unsupervised-attribute category. This was a safe move just to inform WEKA that the class attribute belongs to “Class”. After the two pre-processing steps especially under-sampling resulted in the following. First, the instances were reduced from 897 to 772 and distinct attributes in the text attribute from 874 to 752. The unique values in the text attribute are from 815 to 732. The sport class or label was assigned the number 2 and the entertainment class was assigned number 1.

The next step was to convert string texts in the text attribute to string to word vectors. To do this I followed the following procedure. I clicked on choose then unsupervised then chose attribute and clicked on “StringtoWordVector” filter. Then clicked on the filter settings and changed the output words count from false to true and left the other parameters at the default level. Changing the output word count to true would enable word frequency to be reported as real numbers and not binary. The total attributes after applying strings to the word vector filter were 1528. From the output, all attributes can be visualized across all the two-class; sport and entertainment. For example, the attribute “twice” has values occurring between 1 and 2. Attribute “minutes” had 9 distinct values which occurred between 0 and 8 with a mean of 0.172 and a standard deviation of 0.717. The attribute “Zealand” had 6 distinct values which occurred between zero and 32. The attribute “would” had 13 distinct values which occurred between 0 and 16 with 6 unique values. The attribute “will” has 13 distinct values which occurred between 0 and 14 and 1 unique value. The attribute “will” has 19 distinct values occurring between 0 and 21 with 2 unique values. The attribute “up” has 10 distinct values occurring between 0 and 13 with 4 unique values.

Attribute Ranking

Next, the attributes were ranked to see those that passes the test using the “Attribute selection” filter from the supervised-attribute category. Used evaluator as “Infor Gain Attribute Eval” Used rank as search best and set the threshold to 0. The result was that only 951 attributes out of 1528 passed the test. The attribute film was ranked first, in the visualize section, the film occurred zero times in a bar where sport occupies more than 60%. With no doubt, no one could imagine having the word film in the spot class. The word film was ranked first because it is present in most instances. This is probably why the film was ranked first. The word “players” was ranked 6th it occurred 0 times in 648 instances, once in 66 instances, twice in 36 instances, and 8 times in one instance. The word battle was ranked at 951. It had 0 appearances in 735 instances, once in 36 instances, and twice in 4 instances. This word was ranked last because it is absent in most instances.

Next is to convert the dataset and save it as two databases to explore the question: Does data discretization improve the performance of the model. The first data set was saved without applying discretization. This data was saved undiscretized and as an ARFF file. For the second data, discretized filter with 40 bins was applied. This second data was saved as discretized and as an ARFF file.

Here, each database created was split into two: training set consisting of 80% of the data and test set taking the other 20% of the data. The visualization of the training set showed that most attributes were poorly correlated with other attributes. The training set was subjected to modeling to arrive at the best model to use on the test set. The data undiscretized was reloaded to WEKA. Changed the assigned class attribute to the class using the edit and then right-clicking on the class and assigning it to the class. Then ran J48 at 80% split as the train set. The results were as follows. 151 instances were correctly classified. Three instances were incorrectly classified. This resulted in a performance of 98.0519%. The model also performed very well overall. For entertainment: the true positive rate was 98.7%, the false-positive rate was 2.7%, the precision was 97.5%, the recall was 98.7% the ROC was 99.2%. For sport: the true positive rate was 97.3%, the false-positive rate was 1.3%, the precision was 98.5%, the recall was 97.3% the ROC was 99.2%. Consequently, the confusion matrix was also good. 78 entertainment instances out of 79 instances were classified as entertainment class and 73 out of 75 instances were classified as sport class.

The next classifier was Naïve Bayes. Still ran at an 80% percentage split for the train set. The results were also quite good. 150 instances were classified and only 4 instances were incorrectly classified, giving a model performance of 97.40%. Overall, the model also performed superbly. For entertainment: the true positive rate was 97.5%, the false-positive rate was 2.7%, precision was 97.5%, the recall was 97.5% and the roc area was 98.6%. For spot: the true positive rate was 97.3%, the false-positive rate was 2.5%, precision was 97.3%, the recall was 97.3% and the roc area was 97.9%. The confusion matrix reflected the same scenario. 77 instances were correctly classified as entertainment and only 2 instances were incorrectly classified as entertainment. 73 instances were correctly classified as sport and only 2 instances were incorrectly classified as sport.

Creating Two Datasets

For the support vector machine, I first used grid search across the four kernels: linear, polynomial, radial bias function, and sigmoid. The process was as follows: clicked on classifiers, selected meta the clicked on the grid search. Clicked on grid search property and changed the following: set xmax to 5, xmin to -2, ymax to 5, ymin to -2. Then changed YExpression and XExpression to “pow(BASE,I). Also changed the X property to classifier.cost and the Y property to classifier.gamma/ For the classifier, I selected Libvsm. In the Libvsm properties, I changed the kernel type each time I ran the grid search. Back to the grid search parameters, I set the filter to “all filter”, then clicked OK and ran the grid search. After four runs. The best kernel was linear with kernel parameters as follows. 100,000 for the cost and 100,000 for the gamma. The results were as follows: 153 instances were correctly classified at 99.35%. For entertainment: true positive rate and recall were 100%. Precision was 98.8%. The false-positive rate was 1.3% and the Roc area was 99.3%. For sport: the true positive rate was 98.7% false positive was 0%, precision was 100%, recall was 98.7%, and Roc area was 99.3%. The confusion matrix also reflected the same with all the 79 entertainment instances being correctly classified. 74 out of 75 sport instances were also correctly classified.

Lastly, the discretized database was loaded into WEKA. The same three classifiers were run but this time the filter under “choose” was not none, it was discretized from the unsupported-attributes. The kernel used was still linear as it turned out to be the best with the same kernel parameters; 100,000 for cost and 100,000 for gamma. The train test was 80% for all the classifiers. Then ran J48 at 80% split as the train set. The results were as follows. 136 instances were correctly classified. 18 instances were incorrectly classified. This resulted to a performance of 88.3%. The model did not perform well in overall. For entertainment: the true positive rate was 96.2%, the false positive rate was 20%, the precision was 83.5%, the recall was 96.2% the ROC was 94.7%. For sport: the true positive rate was 80%, the false-positive rate was 3.8%, the precision was 95.2%, the recall was 80% the ROC was 94.7%. Consequently, the confusion matrix was not all good. 25% (15 out of 60) sport instances were incorrectly classified. However, the entertainment classification did well, 76 out of 79 instances were correctly classified.

The next classifier was Naïve Bayes. Still ran at an 80% percentage split for the train set. The results were also quite good. 152 instances were classified and only 2 instances were incorrectly classified, giving a model performance of 98.7%. Overall, the model also performed superbly. For entertainment: the true positive rate was 98.7%, the false-positive rate was 1.3%, precision was 98.7%, the recall was 98.7% and the roc area was 99.9%. For spot: the true positive rate was 98.7%, the false-positive rate was 1.3%, precision was 98.7%, the recall was 98.7% and the roc area was 99.9%. The confusion matrix reflected the same scenario. 78 instances were correctly classified as entertainment and only 1 instance was incorrectly classified as entertainment. 74 instances were correctly classified as sport and only 1 instance was incorrectly classified as sport.

Performance of Classifiers

For the support vector machine, the same settings were applied to enable comparison: The best kernel was linear with kernel parameters as follows. 100,000 for the cost and 100,000 for the gamma. The results were as follows: 83 instances were correctly classified at 53.9%. For entertainment: the true positive rate and recall were 10.1%. The false-positive rate was 0% and Roc area was 55.1% and the precision was 100%. For sport: the true positive rate and recall were 100%, the false-positive was 89.9%, precision was 51.4% and Roc area was 55.1%. The confusion matrix did a bit better 71 out of 71 entertainment instances were correctly classified. And all the sport instances (75) were correctly classified. This model did not perform well broadly.

Conclusion

To summarize, the question was; “Does data discretization improve the performance of the model?. To answer this question the data was loaded and converted and saved as two different ARFF files with one database discretized while the other database was not.

From the undiscretized database. The performance for J48 was very good. Most of the instances were correctly classified resulting in a performance of 98.05%. On average the true positive rate, precision, and recall were 98.1%. The false-positive rate was 2% and the Roc area was 99.2%. The confusion matrix was also good. The performance for Naïve Bayes was also good. Most instances were classified correctly at 97.4%. on average true positive rate, precision, and recall all had 97.4%. ROC Area had 98.2%. The confusion matrix was also good. For the support vector machine, I used the Libsvm with the best kernel being the linear kernel. The parameters of the kernel were; 100000 for the cost and 100000 for the gamma. The results were 99.35% performance. On average; true positive rate, false-positive rate, and recall were 99.4%. The false-positive rate was 0.7% and the ROC area was 99.3%. The confusion matrix was also good.

The next and final database was discretized. The same classifiers were run. For the support vector machine and other classifiers. For the J48 were the model classified 136 instances correctly at 88.3%. The model did not perform well. On average true positive rate was 88.1%, the false-positive rate was 11.9% precision was 89.3% recall was 88.1% and the roc area was 94.7%. The confusion matrix flushed the model as poor with 25% of all the entertainment instances being incorrectly classified. For the Naïve Bayes, the model was pretty good. 152 instances were correctly classified at 98.7%. On average true positive rate, precision, and recall all had 98.7%. The false-positive rate was 1/3% and the Roc area was 99.9%. For the support vector machine, the model was poor. 83 instances were correctly classified at 53.9%. On average, the true positive rate was 53.9%, the false-positive rate was 43.8%, precision was 76.3%, the recall was 53.9% and the ROC area was 55.1%. The confusion matrix tried but with all 75 sport instances being correctly classified and 71 out of 79 entertainment instances being correctly classified. On average the model performed poorly.

From the results, the discretized data performed poorly on J48 and the support vector machines. This means that for the same setting for the models the discretized data has a negative effect on the model performance especially decision tree and support vector machine. The models do not perform well when the database is discretized. However, the models perform very well when the data is not discretized.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2022). Can Data Discretization Improve Model Performance? - A WEKA Analysis Essay.. Retrieved from https://myassignmenthelp.com/free-samples/cmse11459-data-mining-1/data-modeling-tools-like-weka-file-A1D2B5E.html.

"Can Data Discretization Improve Model Performance? - A WEKA Analysis Essay.." My Assignment Help, 2022, https://myassignmenthelp.com/free-samples/cmse11459-data-mining-1/data-modeling-tools-like-weka-file-A1D2B5E.html.

My Assignment Help (2022) Can Data Discretization Improve Model Performance? - A WEKA Analysis Essay. [Online]. Available from: https://myassignmenthelp.com/free-samples/cmse11459-data-mining-1/data-modeling-tools-like-weka-file-A1D2B5E.html
[Accessed 26 April 2024].

My Assignment Help. 'Can Data Discretization Improve Model Performance? - A WEKA Analysis Essay.' (My Assignment Help, 2022) <https://myassignmenthelp.com/free-samples/cmse11459-data-mining-1/data-modeling-tools-like-weka-file-A1D2B5E.html> accessed 26 April 2024.

My Assignment Help. Can Data Discretization Improve Model Performance? - A WEKA Analysis Essay. [Internet]. My Assignment Help. 2022 [cited 26 April 2024]. Available from: https://myassignmenthelp.com/free-samples/cmse11459-data-mining-1/data-modeling-tools-like-weka-file-A1D2B5E.html.