Demonstrate applied knowledge of people, markets, finances, technology and management in a global context of business intelligence practice (data warehouse design, data mining process, data visualisation and performance management) and resulting organisational change and how these apply to implementation of business intelligence in organisation systems and business processes
- identify and solve complex organisational problems creatively and practically through the use of business intelligence and critically reflect on how evidence based decision making and sustainable business performance management can effectively address real world problems
- demonstrate the ability to communicate effectively in a clear and concise manner in written report style for senior management with correct and appropriate acknowledgment of main ideas presented and discussed.
Task 1.1) Conduct an exploratory data analysis (EDA) of the house-prices.csv data set using the RapidMiner Studio data mining tool. Provide the following for Task 1.1:
(i) a screen capture of your final EDA process and briefly describe your EDA process
(ii) summarise key results of your exploratory data analysis in a table named Table 1.1 Results of Exploratory Data Analysis for House-Prices.csv.
(iii) Discuss the key results of your exploratory data analysis presented in Table 1.1 and provide a rationale for why you have selected your 5-6 top variables for predicting house prices and in particular their relationship with the house price drawing on the results of your EDA analysis and relevant literature (About 500 words).
Table 1.1 should include the key characteristics of each variable in the house-prices.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc.
Hint: The Statistics Tab and the Chart Tab in RapidMiner provide a lot of descriptive statistical information and the ability to create useful charts like Barcharts, Scatterplots etc for the EDA analysis. You might also like to look at running some correlations and chi square tests on the house-prices.csv data set to indicate which variables you consider to be the top 5-6 key variables which contribute most to predicting house prices.
Task 1.2) Build a Linear Regression model for predicting house price using a RapidMiner data mining process and an appropriate set of data mining operators and a reduced set of variables from the house-prices.csv data set determined by your exploratory data analysis in Provide the following for Task 1.2:
(i) A screen capture of Final Linear Regression Model process and briefly describe your Final Linear Regression Model process
(ii) A table named Table 1.2 named Results of Final Linear Regression Model for
Task 1.2 for house-prices.csv data set.
(iii) Discuss the results of the Final Linear Regression Model for house-prices.csv data set drawing on the key outputs (coefficients, standardised coefficients, t-statistics values, p-values and significance levels etc) for predicting house prices and relevant supporting literature on the interpretation of a Linear Regression Model (About 500 words).
Exploratory Data Analysis
Exploratory Data Analysis was performed for the given data set in Rapid-Miner software platform (Andrews, Sanchez & Johansson, 2011). In the process of EDA the data file was bridged with the select attribute to exclude variables like id, date for their nominal and ordinal property. Variable Zip code was also excluded from the analysis as descriptive for the variables were redundant. The select attribute was then joined with the result port of the process window to obtain the descriptive of all other variables (Iacoviello & Neri, 2010).
- Summarization of key results of exploratory data analysis:
Table 1.1: Summary of EDA values for task 1. (i)
Field name |
Maximum |
Minimum |
Missing values |
Mean |
Standard deviation |
Mode |
price |
$7700000.0 |
$75000.0 |
0 |
$ 540088.14 |
$367127.19 |
$450000 |
bedrooms |
33.0 |
0.0 |
0 |
3.37 |
0.93 |
3 |
bathrooms |
8.0 |
0.0 |
0 |
2.11 |
0.77 |
2.5 |
sqft_living |
13540.0 |
290.0 |
0 |
2079.89 |
918.44 |
1300 |
sqft_lot |
1651359.0 |
520.0 |
0 |
15106.96 |
41420.51 |
5000 |
floors |
3.5 |
1.0 |
0 |
1.49 |
0.53 |
1 |
waterfront |
1.0 |
0.0 |
0 |
0.01 |
0.08 |
0 |
view |
4.0 |
0.0 |
0 |
0.23 |
0.76 |
0 |
condition |
5.0 |
1.0 |
0 |
3.40 |
0.65 |
3 |
grade |
13.0 |
1.0 |
0 |
7.65 |
1.17 |
7 |
sqft_above |
9410.0 |
290.0 |
0 |
1788.39 |
828.09 |
1300 |
sqft_basement |
4820.0 |
0.0 |
0 |
291.50 |
442.57 |
0 |
sqft_living15 |
6210.0 |
399.0 |
0 |
1986.55 |
685.39 |
1540 |
sqft_lot15 |
871200.0 |
651.0 |
0 |
12768.45 |
27304.17 |
5000 |
lat |
47.77 |
47.15 |
0 |
47.56 |
0.139 |
|
long |
-121.31 |
-122.52 |
0 |
-122.21 |
0.141 |
Exploratory data analysis found the mean price to be $540088.14 for housing projects with a standard deviation (S.D) of $367127.19. The average bedroom and bathroom for the housing projects were 3.37 (S.D= 0.93) and 2.11 (S.D=0.77). Trend revealed that construction of housing projects with three bedrooms and two bathrooms were done in the year 2014-2015 in United States. Average area for living was 2079.89 square units with S.D of 918.44 square units. Mode of living space was found to be at 1300 square units (Mohit, Ibrahim & Rashid, 2010). The analysis also revealed that waterfront view or view of nature from the projects were almost not available. The mean gradation score was 7.65 (S.D=1.17), housing projects were located averagely at 47.56 latitude (S.D=0.14) and -122.21 longitude (S.D=0.14).
(iii) Number of bathrooms, total square feet available for living, location of the project (latitude), gradation of material used for construction (grade) and year of built (yr_built) of the house were main variables for prediction on housing prices in the model. Pearson’s correlation was found between all the relevant variables of EDA process. It was found that price was significantly correlated with the above variables. From correlation analysis variables as grade, view, floor and year of built of the housing project were also found to be significantly correlated with price.
Table 1.11: Correlation coefficient values for major predicting variables
Correlation values |
price |
grade |
bathrooms |
sqft_living |
lat |
Yr_built |
price |
1 |
0.667 |
0.525 |
0.702 |
0.307 |
0.054 |
Correlation was performed in Rapid Miner platform and the process window is given in the figure 2.
The correlation matrix from Rapid-Miner is provided in figure 3.
Price was significantly correlated with floor and view of the housing project but from exploratory data analysis it was observed from the scatter plot (figure 4 & 5) that these variables were not significant enough in nature for explaining and predicting price of housing.
Chi-square goodness of fit was also used to check the validity of the data provided for price prediction. The dependent or label variable was taken as price in the process of weighted chi-square. The chi square test of goodness of fit calculated weight for bathrooms (638.35), total square feet available for living (230.78), grade (329.88), latitude (7652.40) and year built (2095.85). These variables were found to be significantly correlated from Pearson’s correlation test and scattered diagram from EDA results. Selection of the five major factors was approved by the confirmatory tests. Square feet above and square feet available were also significantly correlated with price but they were ignored in the study as square feet for living was already a major decision variable. Chung Chun Lin approached the prediction of real estate prices with multiple regression and non parametric model in 2013. Shiau Hui Kok, an eminent professor of economics in department of Economics, University Putra Malaysian also worked to analyze the predicting factors on real estate price in Malaysia form 2002 to 2015. Macro economic theories were used to predict the price of real estate. In this study the external effects such as GDP per capita of a country, purchasing power capacity of residents with respect to year and place were not considered for predicting the price of housing projects. Price of a housing project is always location dependent. In this study, location of housing properties in United States, latitude and longitude of the places were found to be correlated with price. In 2010 Mohammad Abdul Mohit and Mansor Ibrahim in Malaysia assessed that low cost housing with proper social environment and nearby facilities was very popular and lucrative option for the buyers. Correlation and multiple regression models were used to analyse that social environment had low level of association with nearby facilities and customers were more satisfied with accommodation facilities rather than environmental facilities. The study variables were appropriately chosen based on the facts of the given data which resembled with the earlier work.
Linear Regression Model
(i) The linear regression model was setup with five decision variables to observe the effect on price of housing property. The scatter diagrams (figure 9-12) of the price versus decision variables were studied before executing the regression model. Significant amount of association was observed between price and the five decision variables; the regression process in Rapid Miner was constructed as in figure 8.
The coefficients of regression and significance value were stated in table 1.2 (from figure in appendix). It was observed that p-value or significance values were all zero. Hence the claim of the significance of the five decision variables was established again. Gradation of the materials (122655.28) and location (latitude) (527912.45) of the housing projects was found to be have highest positive effect on the price whereas year of built or age of the project had high negative effect (-3300.84) on price.
A house is situated in a physical area and encompassed by an area which is something that changes to some key angles, its environment and neighborhood facilities. In the event that an area is near business or market territories, at that point the house costs are higher when contrasted with partners in the neighborhoods. The larger the square foot of the house, the more expensive it can be. Also, the number of bedrooms largely influences a home’s value. So, a house with the several numbers of bedrooms is more likely to have high curb appeal as opposed to a villa with just one bedroom. A house in a provincial or less created zone will dependably cost not exactly those in the all around created or urban region. Additionally, an area with an awesome availability to interstates, turnpikes, schools, shopping centers and neighborhood business openings adds to the additional house estimation. Hence the results of the study were in line with the previous observation and theories. Mohammad Abdul Mohit and Mansor Ibrahim in Malaysia in 2010 observed this relation earlier with the use of multiple regression analysis.
Table 3: Results of Final Linear Regression model for task 1.2
Attribute |
Coefficient |
Std.error |
t-statistic |
p-value |
grade |
122655.284 |
2124.72 |
57.72 |
0 |
bathrooms |
37074.53 |
3281.31 |
11.30 |
0 |
sqft_living |
166.82 |
3.03 |
55.114 |
0 |
lat |
527912.45 |
11159.32 |
47.31 |
0 |
Yr_built |
-3300.837 |
63.31 |
-52.14 |
0 |
intercept |
-19426027.44 |
568218.49 |
-34.19 |
0 |
The intercept was high negative with value of -19426027.44 which indicated that in absence of the deciding five factors housing prices will fall drastically. The intercept was highly significant in nature with a p value of zero.
The t-statistic value for all the five deciding variables in regression analysis nullified any hypothetical claim of disassociation of price with them. The t-statistic value for all the variables (table 1.2) were in the rejection region at 5% level of significance. The alternate hypothesis of significant association of the decision variables with price was established.
Price with number of bathrooms-Rapid Miner result
2.1 Tableau Desktop View of House Prices 2014 to 2015
Quarterly price variation of housing
The tableau desktop version 10.5 was used to create the graph view of the data. The data file was connected to the tableau application and in the new sheet price (measure) was placed on the vertical axis with year (dimension) in the horizontal axis. Year was expanded for quarterly explanation of the trend. It was seen that price of the housing projects varied from quarter to quarter in the given time frame. Sharp decline in prices were notice in the third and fourth quarter of 2014. Prices again climbed in first quarter of 2015 and took a sharp dip in the second quarter of 2015.
Results of Exploratory Data Analysis
Tableau was used to construct graphical representation to analyse the degree of association of the five decision variables with price. Line diagrammatic presentation for the same has been provided in figures (14-17). The labels on the graph were given by dragging the measure variables on label button of the marks section in the worksheet.
Price versus square feet for living-Tableau desktop version
The line diagram for price versus square feet living (Sqft_living), labels in the graph were created based on total Sqft_living sold at different time from second quarter of 2014 to second quarter of 2015.
In price versus bathroom line diagram in figure 15 labels were done based on total bathroom sold with housing throughout the given time frame (Vahlne & Johanson, 2017).
Price with number of bathrooms -Desktop Tableau version
The gradation for the housing projects by government or some local reputed agencies were plotted against price. Average gradation was considered for the purpose. Degrading housing projects were noticed in 2015 in contrast to 2014.
The latitude versus price revealed that average location of housing property sold were situated little south in 2015, whereas in 2014 properties at north got sold more comparitvely.
Price with latitude-Desktop Tableau version
2.2 Geo map Graph view for all the variables were created in desktop tableau by selecting the latitude and longitude from measures. These fields were generated by the software itself based on the data (Agnello & Schuknecht, 2011). After selecting zip codes where zip codes were assigned as zip code nature, longitude generated and latitude generated were plotted against columns and rows a map of the world with seventy null values. The location of the graph was changed to USA based on the zip codes and required geo map was obtained (Kok, Ismail & Lee, 2018). Figure 18 is the geo map for sum of per square feet for living in different locations in United States.
A second geo map was constructed for number of bathrooms sold in different locations of USA. Maximum number of bathrooms sold was used in the graph (figure 19) for labels (Dooley & Hutchison 2009).
Gradation of construction was also plotted in geo map for different zip codes of USA. Sum of the grades were used to select different color for the map (Chun Lin & Mohan, 2011). Average gradation was taken for the labels of the geo map. Average gradation was much higher in northern part compared to the southern section of USA (figure 20).
Geo map for year of built was also found in desktop tableau across different location in USA from the given set of data (figure 21). The labels were given as year built from dimension section of the worksheet.
References
Agnello, L. and Schuknecht, L., 2011. Booms and busts in housing markets: determinants and implications. Journal of Housing Economics, 20(3), pp.171-190.
Andrews, D., Sanchez, A.C. and Johansson, Å., 2011. Housing markets and structural policies in OECD countries. OECD Economic Department Working Papers, (836), p.0_1.
Chun Lin, C. and Mohan, S.B., 2011. Effectiveness comparison of the residential property mass appraisal methodologies in the USA. International Journal of Housing Markets and Analysis, 4(3), pp.224-243.
Dooley, M. and Hutchison, M., 2009. Transmission of the US subprime crisis to emerging markets: Evidence on the decoupling–recoupling hypothesis. Journal of International Money and Finance, 28(8), pp.1331-1349.
Iacoviello, M. and Neri, S., 2010. Housing market spillovers: evidence from an estimated DSGE model. American Economic Journal: Macroeconomics, 2(2), pp.125-64.
Kok, S.H., Ismail, N.W. and Lee, C., 2018. The sources of house price changes in Malaysia. International Journal of Housing Markets and Analysis.
Mohit, M.A., Ibrahim, M. and Rashid, Y.R., 2010. Assessment of residential satisfaction in newly designed public low-cost housing in Kuala Lumpur, Malaysia. Habitat international, 34(1), pp.18-27.
Vahlne, J.E. and Johanson, J., 2017. The internationalization process of the firm—a model of knowledge development and increasing foreign market commitments. In International Business (pp. 145-154). Routledge.
To export a reference to this article please select a referencing stye below:
My Assignment Help. (2021). Exploratory Data Analysis And Linear Regression Essay For House Prices.csv Data Set.. Retrieved from https://myassignmenthelp.com/free-samples/cis8008-business-intelligence/international-money-and-finance.html.
"Exploratory Data Analysis And Linear Regression Essay For House Prices.csv Data Set.." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/cis8008-business-intelligence/international-money-and-finance.html.
My Assignment Help (2021) Exploratory Data Analysis And Linear Regression Essay For House Prices.csv Data Set. [Online]. Available from: https://myassignmenthelp.com/free-samples/cis8008-business-intelligence/international-money-and-finance.html
[Accessed 12 November 2024].
My Assignment Help. 'Exploratory Data Analysis And Linear Regression Essay For House Prices.csv Data Set.' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/cis8008-business-intelligence/international-money-and-finance.html> accessed 12 November 2024.
My Assignment Help. Exploratory Data Analysis And Linear Regression Essay For House Prices.csv Data Set. [Internet]. My Assignment Help. 2021 [cited 12 November 2024]. Available from: https://myassignmenthelp.com/free-samples/cis8008-business-intelligence/international-money-and-finance.html.