Initial Data Understanding Report
When it comes your section on Data Understanding, you can be quite clear about the data you were given to work with and what you have had to do with it, the quality of the data and how useful it will be. Â A good structural plan for this section would be to follow the tasks that are asked, as a group; and to produce the outputs that are required.
Â
Collect initial dataÂ
Task: Collect initial dataÂ
Acquire the data (or access to the data) listed in the project resources.Â
This initial collection includes data loading, if necessary for data understanding.Â
For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. This effort possibly leads to initial data preparation steps. Note: if you acquire multiple data sources, integration is an additional issue, either here or in the later data preparation phase.Â
Â
Output Initial data collection reportÂ
List the dataset(s) acquired, together with their locations, the methods used to acquire them, and any problems encountered.Â
Record problems encountered and any resolutions achieved. This will aid with future replication of this project or with the execution of similar future projects.Â
Â
Describe dataÂ
Task: Describe dataÂ
Examine the âgrossâ or âsurfaceâ properties of the acquired data and report on the results.Â
Output: Data description reportÂ
Describe the data that has been acquired, including the format of the data, the quantity of data (for example, the number of records and fields in each table), the identities of the fields, and any other surface features which have been discovered. Evaluate whether the data acquired satisfies the relevant requirements.Â
Explore dataÂ
Â
Task: Explore dataÂ
This task addresses data mining questions using querying, visualization, and reporting techniques. These include distribution of key attributes (for example, the target attribute of a prediction task) relationships between pairs or small numbers of attributes, results of simple aggregations, properties of significant sub-populations, and simple statistical analyses. These analyses may directly address the data mining goals; they may also contribute to or refine the data description and quality reports, and feed into the transformation and other data preparation steps needed for further analysis. 18 CRISP-DM 1.0Â
Â
Output: Data exploration reportÂ
Describe results of this task, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate, include graphs and plots to indicate data characteristics that suggest further examination of interesting data subsets.Â
Verify data qualityÂ
Task: Verify data qualityÂ
Examine the quality of the data, addressing questions such as: Is the data complete (does it cover all the cases required)? Is it correct, or does it contain errors and, if there are errors, how common are they? Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?Â
Output: Data quality reportÂ
List the results of the data quality verification; if quality problems exist, list possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge