Part 1: Collect and Prepare Data
This project requires that you use the tools learned throughout this portion of the course to create a model for a real world situation – creating a model to predict the success of NBA teams.
A tragedy has happened and the world has lost all copies of Win-Loss data for the 2018-2019 season for the NBA. All other stats remain, but we can no longer trust the Win and Loss counts anymore. The only solution the world has is to use your advanced analysis skills to predict what those win totals should be; these win predictions will be written into the history books, so we need to be accurate!
Note: You do not need to have an advanced, or even a basic, knowledge of basketball or the NBA. This entire project can be completed without learning the details of what all these statistics mean in relation to an actual game – just like some of the examples (wine quality, insurance) have inputs that we don’t fully understand. DO NOT spend your time trying to learn about basketball, or trying to apply basketball knowledge you do have to the assignment. It will not help, use the techniques we’ve talked about in class.
Note: This should be something that was mostly completed in the chapter 5 assignment. Start with that sheet and make the edits indicated below.
For this model you will need to prepare two sets of data. The easiest way is to have each on one sheet in the same workbook:
· Source Data – 5 (or more) seasons of data (2017-2018) and back. This will have all statistics from the table (including wins and playoff status), but with the win proxy stats removed.
o Note: This model must not contain any win-like statistics such as losses, winning percentage, Pythagorean wins/losses, margin of victory, SRS, etc. When you’ve created the sheet you must remove the Win-proxy stats, some of those are L, PW, PL, MOV, and SRS. There may be others if you’ve added optional data to your model. Ensure you double check this as inclusion of any win/loss statistics will ruin the accuracy of your model.
· Subject Data – The most recent season (2018-2019). This will have the same set of statistics as the source data, with the exception that the wins and playoff columns should be blank – this is what we are predicting. The classification will use the data to predict the playoff status and the prediction will predict the number of wins.
Once your data is prepared you can begin creating your predictive model. Any and all of the tools we looked at in the course are available to you. You may find that as you proceed in building your model that data needs to be added or removed from your initial worksheet. You may also choose to use other techniques such as normalization and partitioning to create a more accurate model.
As you are going through this process you must take note of what method you are using, what changes you make to the model data, and why you are making those decisions. You will need to present both your model, and the reasoning of why you built it as you did and why it is superior to the alternatives that proved to be less accurate. The process of developing your model is the most important part of this process, so ensure you are making logical improvements and documenting the reasoning and impact.
Note: Use the source data to build a predictive model targeted at predicting number of wins, then use that model to predict the number of wins on the subject data.