This report details the analysis, processes, decisions taken and algorithms implemented to clean up data in python. Data cleaning is an integral part of data analysis and analytics. To effectively analyze any scientific data, thorough cleaning has to be done, to eliminate or correct erroneous, malformed or missing data. The first part of this exercise entails cleaning up data, to eliminate any lexical errors, irregularities, violation of the integrity constraints and eliminate inconsistencies.
Task 1: Auditing and Cleansing the Job dataset
The dataset was first audited manually, through observation, to identify the most prevalent types of errors. To effectively clean the data, each column of the data set was examined, and identified errors fixed, to meet the requirements of the specific column. The second step of auditing was done using various python commands, before the coding started.
An inspection of the columns revealed the first issue, of columns inconsistencies. While all other columns had spaces, the ‘Salary per annum’ column had spaces and first characters of words not capitalized.
Figure 2.0 Fixing column names inconsistencies
While auditing the data in the columns, three columns were identified as having missing, lexical errors, irregularities, inconsistencies and violation of integrity constraints. The columns identified were;
Title: the column contained the title of the job position; the most prevalent error found was the presence of Non-ASCII characters. The cleaning on this column entailed removing the characters.
The company column had numerous missing entries, an analysis of the jobs with missing company entry were Healthcare & Nursing Jobs from the careworx.co.uk website. The company recruiting the Healthcare workers was therefore “careworx”, as such, all entries without the company name, where the category was “Healthcare & Nursing Jobs”, sourced from careworx.co.uk, were all with the company name CareWorx. Other entries that had no company name entered were also found to be mostly sourced directly from companies that were likely recruiting their own staff, as such; the company names were also filled with the first name of the company’s domain.
Salary per annum: the column was found to have numerous inconsistencies in data entry, with some records having the letter “K” indicating ‘thousand’, cleaning this column required replacing K with three zeros, to make the entry a valid number. Where entries were in form of a range, the values were replaced with an average salary for the given range.
Task 2. Integrating the Job datasets
Inspecting data frames for conflicts; a df.columns command shows that both datasets shares common columns. While merging data, conflicts would occur in case of columns mismatch, especially where one schema has more columns than the other. In this case, both schemas have the same number of columns.
A potential source of conflict is if the merge occurs before fixing column name inconsistencies in dataset 2. To avoid the conflict, the dataset was cleaned to fix the inconsistencies. Since the intended merge combines the datasets, an “OUTER” merge was used.
Task 3 Finding missing value and fill in the reasonable values
The missing values in the data frame represented by NaN were replaced using the interpolate() function. The use of the method was motivated by the fact that interpolating the data gave the closest and most realistic values.
Task 4. Finding the outliers
To ease the process of identifying outliers, a number of approaches were used. The first step was to get the summary statistics for all the data. From the summary statistics, price and sqft_living were two of the variables that showed potential of having outliers.
Figure 4.0 Boxplot grouped by sqft_living against price, showing presence of some outliers
To automate the process of removing the outliers, we use the standard deviation method. Where data is filtered, entries with high standard deviations are left out, effectively eliminating outliers. Outliers were found to have standard deviations that steeply differed from the other data points, thus the standard deviation method is an ideal way of eliminating outliers.
Kimball, R. and Caserta, J., 2011. The Data WarehouseÂ ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons.
McKinney, W., 2012. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.".
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S. and Xin, D., 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), pp.1235-1241.