Data Mining: Steps - Techniques and Preprocessing

Data Mining: Steps, Techniques and Preprocessing

Answered

Data collection

Feeding Data Mining In This Step, We Need To Use Your Data To Feed Your Model. Post Processing In This Step, We Need To Interpret And Evaluate Your Model.

Data collection is one of the key steps in the data mining process, as such; poor quality of data collection can have negative effects on the entire data mining process. There are several data collection methods but since this task involved big data, online data collection had to be done using a web data collection tool (Moser and Korstjens, 2018). The procedure used in collecting data is stipulated as follows:

The data mining topic was first defined
The problem was identified
The sources of data were established
The analysis indices were identified
The actual data collection was done

After the problem was pinpointed, the next step was to select the data that was to be used while carrying out the task. At this point, the value of choosing indices that are relevant to the data mining topic cannot be over-emphasized. Putting into consideration the task at hand, a dataset that contained age, gender and weight was settled upon. A proper source of data was selected and subsequently, a set of related analysis indices was identified. The source website where the data was obtained was able to provide a file for direct download hence making the process of obtaining the data easier. The raw data was then collected from the select website and taken through a data cleaning process. Data cleaning involved the process of discerning and correcting dirty records from the dataset and also, taking the data through a verification process where the certain incomplete or irrelevant data was modified, replaced or in some instances, totally erased.

The pre-processing of data during the data mining process is a means of transforming raw data into meaningful and efficient datasets (Roiger, 2017). There are several steps involved in data pre-processing and they are listed as follows:

Raw data in most cases is made up of missing parts or perhaps, contain irrelevant content. In such cases, data cleaning has to be performed in order to handle the missing data or the noisy data (Sammut and Webb, 2017).

This scenario comes into being when certain sets of data are absent within the dataset. This situation can be dealt with using a variety of techniques namely:

The aforementioned tactic is solely appropriate in situations whereby the available data is quite enormous and several values are absent within the tuple.

There are a variety of techniques that could be used while carrying out this undertaking. An individual may decide to plug the absent values mundanely, by attributing the mean or the most plausible value.

Noisy data is defined as the data which carry no meaning hence cannot be deduced by any machine (Eldén, 2019). In most cases they are generated as a result of defective data collection and errors that may have occurred during the data entry process and can be resolved as follows:

The entire data is distributed into segments of equal sizes. Each segment is then handled separately perhaps, by replacing all data within the segment using its mean or boundary values.

This is done with a goal of converting the data into appropriate forms that can be used in the data mining process.

1 Male

0 Female

Gender Weight Age (data mined)

1 20 7

0 65 23

1 54 34

0 46 45

0 57 42

0 57 41

0 56 34

1 54 31

0 54 32

1 53 23

0 67 70

1 90 45

0 89 45

1 67 56

1 63 56

0 61 76

0 56 45

0 76 32

1 93 34

1 66 44

List of Reference

Eldén, L. (2019). Matrix methods in data mining and pattern recognition (Vol. 15). Siam.

Moser, A., & Korstjens, I. (2018). Series: Practical guidance to qualitative research. Part 3: Sampling, data collection and analysis. European Journal of General Practice, 24(1), 9-18.

Roiger, R. J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.

Sammut, C., & Webb, G. I. (2017). Encyclopedia of machine learning and data mining. Springer Publishing Company, Incorporated.

Get instant help from 5000+ experts for