"Bike analysis with Random Forest & Boosting."

To demonstrate understanding of the theory and practice of scalable distributed data analysis

Types of the data set used in the study

Results from the bike sharing data is shown in the first section. The random forest model and the gradient bosting regressor model has been performed with this data set and the results are shown below.

Types of the data set used in the study

(17379, 17)

instant int64

dteday object

season int64

yr int64

mnth int64

hr int64

holiday int64

weekday int64

workingday int64

weathersit int64

temp float64

atemp float64

hum float64

windspeed float64

casual int64

registered int64

cnt int64

dtype: object

Descriptive statistics of the data set :
instant season yr mnth hr

count 17379.0000 17379.000000 17379.000000 17379.000000 17379.000000

mean 8690.0000 2.501640 0.502561 6.537775 11.546752

std 5017.0295 1.106918 0.500008 3.438776 6.914405

min 1.0000 1.000000 0.000000 1.000000 0.000000

25% 4345.5000 2.000000 0.000000 4.000000 6.000000

50% 8690.0000 3.000000 1.000000 7.000000 12.000000

75% 13034.5000 3.000000 1.000000 10.000000 18.000000

max 17379.0000 4.000000 1.000000 12.000000 23.000000

holiday weekday workingday weathersit temp

count 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000

mean 0.028770 3.003683 0.682721 1.425283 0.496987

std 0.167165 2.005771 0.465431 0.639357 0.192556

min 0.000000 0.000000 0.000000 1.000000 0.020000

25% 0.000000 1.000000 0.000000 1.000000 0.340000

50% 0.000000 3.000000 1.000000 1.000000 0.500000

75% 0.000000 5.000000 1.000000 2.000000 0.660000

max 1.000000 6.000000 1.000000 4.000000 1.000000

atemp hum windspeed casual registered

count 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000

mean 0.475775 0.627229 0.190098 35.676218 153.786869

std 0.171850 0.192930 0.122340 49.305030 151.357286

min 0.000000 0.000000 0.000000 0.000000 0.000000

25% 0.333300 0.480000 0.104500 4.000000 34.000000

50% 0.484800 0.630000 0.194000 17.000000 115.000000

75% 0.621200 0.780000 0.253700 48.000000 220.000000

max 1.000000 1.000000 0.850700 367.000000 886.000000

cnt

count 17379.000000

mean 189.463088

std 181.387599

min 1.000000

25% 40.000000

50% 142.000000

75% 281.000000

max 977.000000

The results from the descriptive statistics is shown in the table above and the results includes the value for mean, median, standard deviation and the three major percentiles.

Results from the feature length:

Feature vector length for categorical features: 4

Feature vector length for numerical features: 7

Total feature vector length: 11

Results from the Random forest regression model

Decision Tree feature vector : [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

Decision Tree feature vector length: 10

Decision Tree predictions: [ 38.07836056 38.07836056 38.07836056 ..., 38.07836056 38.07836056

38.07836056]

Decision Tree depth: 2

Decision Tree depth: None

Results from the Decision tree log

Decision Tree Log - Mean Squared Error 3648.31141046

Decision Tree Log - Mean Squared Error 43.7264464287

The results from the decision tree and the decision tree log is shown in the table above and the it shows that the decision tree log for the first model is 3648.311 whereas in the second decision tree log mean squared error is 43.72. On the basis of the errors it can be said that the second model is better as the error value is less.

Results from the task 2 has been discussed in the current section

Descriptive statistics of the data set

Gradient booster Regression

Decision Tree feature vector : [ 0. 0. 0. 0. 0. 0. 0.

0. 0.03307816 0.96692184]

Decision Tree feature vector length: 10

Decision Tree predictions: [ 87.22443325 87.22443325 87.22443325 ..., 117.91996176 99.23814659

87.22443325]

Decision Tree depth: 2

Decision Tree depth: None

Decision tree log

For the gradient booster regression the decision tree log has been shown below.

Decision Tree Log - Mean Squared Error 6359.19130385

Decision Tree Log - Mean Squared Error 57.0455144575

As the results shows in this case also the mean squared error is less in the second case, which means that the second model is better as compared to the first one.

For the second part of the project the data from the Kaggle has been downloaded and similar methodology was used to analyze the data. In this case the mtcars data from kaggle has been used for the analysis purpose and the results from the analysis are shown in the below section.

Description of the data

(32, 12)

Unnamed: 0 object

mpg float64

cyl int64

disp float64

hp int64

drat float64

wt float64

qsec float64

vs int64

am int64

gear int64

carb int64

As the results shows there are 12 different types of variables included in the data set with 32 data points.

Descriptive statistics of the continuous variable

mpg cyl disp hp drat wt

count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000

mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250

std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457

min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000

25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250

50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000

75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000

max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000

qsec vs am gear carb

count 32.000000 32.000000 32.000000 32.000000 32.0000

mean 17.848750 0.437500 0.406250 3.687500 2.8125

std 1.786943 0.504016 0.498991 0.737804 1.6152

min 14.500000 0.000000 0.000000 3.000000 1.0000

25% 16.892500 0.000000 0.000000 3.000000 2.0000

50% 17.710000 0.000000 0.000000 4.000000 2.0000

75% 18.900000 1.000000 1.000000 4.000000 4.0000

max 22.900000 1.000000 1.000000 5.000000 8.0000

Descriptive statistics of the variables is presented in the table above which helps the researcher to have an overview of the collected data set.

Results from the Feature vector length:

Feature vector length for categorical features: 5

Feature vector length for numerical features: 5

Total feature vector length: 10

As the result suggest total feature vector length is 10 where the numerical is 5 and categorical is 5.

Results from the Decision tree:

Decision Tree feature vector : [ 0.16263247 0.31856472 0.04314577 0.00312414 0.4657291 0.0068038 0.

0. 0. ]

Decision Tree feature vector length: 10

Decision Tree predictions: [ 20.38783883 20.38783883 23.62423918 19.85791236 17.32767526

19.33232794 14.96165695 21.1793161 21.5733161 19.33232794

19.33232794 15.45256604 15.45256604 15.45256604 14.35451409

14.35451409 14.96165695 30.30945238 29.08578571 30.90028571

21.5733161 16.07032361 17.32767526 14.96165695 16.07032361

28.60745238 25.69613095 28.81042857 16.74459301 20.38783883

14.96165695 21.5733161 ]

Decision Tree depth: 2

Decision Tree depth: None

Decision Tree Log - Mean Squared Error 2.99511855857

Decision Tree Log - Mean Squared Error 1.34997530962

In this case also the mean squared error for the second model is less as compared to the model 1. However the difference in the model is not very high.

Results from the Gradient boosting Regressor

Decision Tree feature vector : [ 0.26241313 0.29232947 0.18088804 0. 0.25787725 0.00649211

0. 0. 0. ]

Decision Tree feature vector length: 10

Decision Tree predictions: [ 20.36937837 20.36937837 21.608521 19.67536304 17.93841175

19.47194207 16.63320595 21.608521 21.608521 18.96648356

18.96648356 17.93841175 17.93841175 17.93841175 14.747734 14.747734

16.63320595 27.65988835 26.81396877 27.65988835 21.608521

17.93841175 17.93841175 16.63320595 17.93841175 26.34583871

23.69848123 26.34583871 17.05336798 19.86391986 16.63320595

21.608521 ]

Decision Tree depth: 2

Decision Tree depth: None

Decision Tree Log - Mean Squared Error 2.99511855857

Decision Tree Log - Mean Squared Error 1.34997530962

In this case also the mean squared error are less in the second case as compared to the first case.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2021). Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor. Retrieved from https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html.

"Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html.

My Assignment Help (2021) Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor [Online]. Available from: https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html
[Accessed 23 April 2024].

My Assignment Help. 'Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html> accessed 23 April 2024.

My Assignment Help. Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor [Internet]. My Assignment Help. 2021 [cited 23 April 2024]. Available from: https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html.