Assignment Data Analysis: Exploring - Cleansing

Assignment Data Analysis: Exploring, Cleansing, and Modeling

Selecting the Data Set

1.Select from the data sets available (or ones designated by your instructor or other available sources). Provide a thorough description of the data set to include the number of cases, description of the inputs, target variable, description of the variables that could be used to develop predictive models, etc. Since this is an assignment using a classification approach, please select data sets which have binary targets. Again, a heavily skewed (at least 75% of the target values are of one type) target variable should be in the data set. The data set used should be different from any used for logistic regression and contain at least 1,500 cases.

NOTE: predictive models are developed better with larger data sets that have many cases and possible inputs from which to select. Part of your grade for this assignment will be based on the robustness of the data set used.

2.Explore the data by searching for anticipated relationships, unanticipated trends and anomalies – to gain deeper understanding and ideas. Use the SEMMA explore option to examine the data set you have created and look for interesting anomalies or relationships.

3.Cleanse and modify the data by removing errors, imputing missing values (as appropriate), transforming the variable distributions as necessary, and creating and selecting appropriate variables. Use the appropriate SEMMA options to cleanse the dataset as necessary. Investigate and discuss any “feature engineering” done for the data set.

4.Develop predictive models using the appropriate predictive modeling technique. Develop complete prediction models. There should be at least two models developed, compared and explained. The imbalanced target variable must be addressed and accounted for using one or more of the methods outlined in earlier lessons.

5.Using appropriate accuracy measures, assess the resultant models. Provide a complete assessment of the different models created using the SAS Enterprise Miner appropriate model assessment options. Explain clearly any insights or conclusions from the accuracy measures.

6.Conclusions and takeaways. Provide clear and concise conclusions about the project to include lessons learned and any suggested improvements for future development. Suggest future enhancements for the analysis.

Get instant help from 5000+ experts for