Task:
In this assignment, you can work independently or in teams of (no more than) 4 students. If you are working on the project as a team, only one person needs to submit this assignment. Make sure to coordinate who is submitting it, however. If you choose to work in a group, there is one additional set of questions for the team provided at the end.
Step 1: You may select 1) any dataset from the machine learning repository https://archive.ics.uci.edu/ml/index.php, 2) any dataset from Kaggle - https://www.kaggle.com/datasets, or 3) any dataset openly provided by an organization, preferably non-profit, that could benefit from this analysis based on the following characteristics [Note: on Kaggle – many of the datasets provide links to the original dataset such that you do not have to set up a new Kaggle profile. Kaggle was acquired by Google in 2017]
1. Must have at least a sample size of 500 observations
2. Must have at least two variables belonging to a legally recognized protected class
3. Must have at least two dependent variables (outcome variables) that could result in favorable or unfavorable outcomes [Note: Use your subjective opinion based on the discussions we’ve had in class]
4. Must be related to one of the regulated domains Credit, Education, Employment, or Housing and ‘Public Accommodation’ [Note: Loosely, any dataset that could have potential bias in outcomes based on protected class membership is acceptable. Also, don’t be biased by how the dataset is labeled/organized –you can think creatively about how to structure the dataset so it’s compliant to the requirements]
Answer the following questions in the final project report:
• Which dataset did you select?
• Which regulated domain does your dataset belong to?
• How many observations are in the dataset?
• How many variables in the dataset?
• Which variables did you select as your dependent variables?
• How many and which variables in the dataset are associated with a legally recognized protected class? Which legal precedence/law (as discussed in the lectures) does each protected class fall under?
1) Identify the members associated with your protected class variables and group together into a subset of membership categories as appropriate
2) Discretize the values associated with your dependent variables into discrete categories/numerical values as appropriate
3) Compute the frequency of each membership category associated with each of your protected class variables
4) Create a histogram for each protected class variable that graphs the frequency values of its membership categories as a function of the dependent variables Provide the following in the final project report:
• Table documenting the relationship between members and membership categories for each protected class variable
• Table documenting the relationship between values and discrete categories/numerical values associated with your dependent variables
• Table providing the computed frequency values for the membership categories each protected class variable