Get Instant Help From 5000+ Experts For
question

Writing: Get your essay and assignment written from scratch by PhD expert

Rewriting: Paraphrase or rewrite your friend's essay with similar meaning at reduced cost

Editing:Proofread your work by experts and improve grade at Lowest cost

And Improve Your Grades
myassignmenthelp.com
loader
Phone no. Missing!

Enter phone no. to receive critical updates and urgent messages !

Attach file

Error goes here

Files Missing!

Please upload all relevant files for quick & complete assistance.

Guaranteed Higher Grade!
Free Quote
wave

To demonstrate understanding of the theory and practice of scalable distributed data analysis

Types of the data set used in the study

Results from the bike sharing data is shown in the first section. The random forest model and the gradient bosting regressor model has been performed with this data set and the results are shown below. 

Types of the data set used in the study

(17379, 17)

instant         int64

dteday         object

season          int64

yr              int64

mnth            int64

hr              int64

holiday         int64

weekday         int64

workingday      int64

weathersit      int64

temp          float64

atemp         float64

hum           float64

windspeed     float64

casual          int64

registered      int64

cnt             int64

dtype: object 

Descriptive statistics of the data set :
  instant        season            yr          mnth            hr  

count  17379.0000  17379.000000  17379.000000  17379.000000  17379.000000   

mean    8690.0000      2.501640      0.502561      6.537775     11.546752   

std     5017.0295      1.106918      0.500008      3.438776      6.914405   

min        1.0000      1.000000      0.000000      1.000000      0.000000   

25%     4345.5000      2.000000      0.000000      4.000000      6.000000   

50%     8690.0000      3.000000      1.000000      7.000000     12.000000   

75%    13034.5000      3.000000      1.000000     10.000000     18.000000   

max    17379.0000      4.000000      1.000000     12.000000     23.000000    

            holiday       weekday    workingday    weathersit          temp  

count  17379.000000  17379.000000  17379.000000  17379.000000  17379.000000   

mean       0.028770      3.003683      0.682721      1.425283      0.496987   

std        0.167165      2.005771      0.465431      0.639357      0.192556   

min        0.000000      0.000000      0.000000      1.000000      0.020000   

25%        0.000000      1.000000      0.000000      1.000000      0.340000   

50%        0.000000      3.000000      1.000000      1.000000      0.500000   

75%        0.000000      5.000000      1.000000      2.000000      0.660000   

max        1.000000      6.000000      1.000000      4.000000      1.000000  

              atemp           hum     windspeed        casual    registered  

count  17379.000000  17379.000000  17379.000000  17379.000000  17379.000000   

mean       0.475775      0.627229      0.190098     35.676218    153.786869   

std        0.171850      0.192930      0.122340     49.305030    151.357286   

min        0.000000      0.000000      0.000000      0.000000      0.000000   

25%        0.333300      0.480000      0.104500      4.000000     34.000000   

50%        0.484800      0.630000      0.194000     17.000000    115.000000   

75%        0.621200      0.780000      0.253700     48.000000    220.000000   

max        1.000000      1.000000      0.850700    367.000000    886.000000    

                cnt  

count  17379.000000  

mean     189.463088  

std      181.387599  

min        1.000000  

25%       40.000000  

50%      142.000000  

75%      281.000000  

max      977.000000   

The results from the descriptive statistics is shown in the table above and the results includes the value for mean, median, standard deviation and the three major percentiles. 

Results from the feature length:

Feature vector length for categorical features:  4

Feature vector length for numerical features:  7

Total feature vector length:  11  

Results from the Random forest regression model 

Decision Tree feature vector :  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]

Decision Tree feature vector length:  10

Decision Tree predictions:  [ 38.07836056  38.07836056  38.07836056 ...,  38.07836056  38.07836056

  38.07836056]

Decision Tree depth:  2

Decision Tree depth:  None 

Results from the Decision tree log

Decision Tree Log - Mean Squared Error 3648.31141046

Decision Tree Log - Mean Squared Error 43.7264464287 

The results from the decision tree and the decision tree log is shown in the table above and the it shows that the decision tree log for the first model is 3648.311 whereas in the second decision tree log mean squared error is 43.72. On the basis of the errors it can be said that the second model is better as the error value is less.  

Results from the task 2 has been discussed in the current section

Descriptive statistics of the data set

Gradient booster Regression

Decision Tree feature vector :  [ 0.          0.          0.          0.          0.          0.          0.

  1.          0.          0.03307816  0.96692184]

Decision Tree feature vector length:  10

Decision Tree predictions:  [  87.22443325   87.22443325   87.22443325 ...,  117.91996176   99.23814659

   87.22443325]

Decision Tree depth:  2

Decision Tree depth:  None

Decision tree log

For the gradient booster regression the decision tree log has been shown below. 

Decision Tree Log - Mean Squared Error 6359.19130385

Decision Tree Log - Mean Squared Error 57.0455144575 

As the results shows in this case also the mean squared error is less in the second case, which means that the second model is better as compared to the first one. 

For the second part of the project the data from the Kaggle has been downloaded and similar methodology was used to analyze the data. In this case the mtcars data from kaggle has been used for the analysis purpose and the results from the analysis are shown in the below section. 

Description of the data

(32, 12)

Unnamed: 0     object

mpg           float64

cyl             int64

disp          float64

hp              int64

drat          float64

wt            float64

qsec          float64

vs              int64

am              int64

gear            int64

carb            int64 

As the results shows there are 12 different types of variables included in the data set with 32 data points. 

Descriptive statistics of the continuous variable

  mpg        cyl        disp          hp       drat         wt  

count  32.000000  32.000000   32.000000   32.000000  32.000000  32.000000   

mean   20.090625   6.187500  230.721875  146.687500   3.596563   3.217250   

std     6.026948   1.785922  123.938694   68.562868   0.534679   0.978457   

min    10.400000   4.000000   71.100000   52.000000   2.760000   1.513000   

25%    15.425000   4.000000  120.825000   96.500000   3.080000   2.581250   

50%    19.200000   6.000000  196.300000  123.000000   3.695000   3.325000   

75%    22.800000   8.000000  326.000000  180.000000   3.920000   3.610000   

max    33.900000   8.000000  472.000000  335.000000   4.930000   5.424000    

            qsec         vs         am       gear     carb  

count  32.000000  32.000000  32.000000  32.000000  32.0000  

mean   17.848750   0.437500   0.406250   3.687500   2.8125  

std     1.786943   0.504016   0.498991   0.737804   1.6152  

min    14.500000   0.000000   0.000000   3.000000   1.0000  

25%    16.892500   0.000000   0.000000   3.000000   2.0000  

50%    17.710000   0.000000   0.000000   4.000000   2.0000  

75%    18.900000   1.000000   1.000000   4.000000   4.0000  

max    22.900000   1.000000   1.000000   5.000000   8.0000   

Descriptive statistics of the variables is presented in the table above which helps the researcher to have an overview of the collected data set.

Results from the Feature vector length:

Feature vector length for categorical features:  5

Feature vector length for numerical features:  5

Total feature vector length:  10 

As the result suggest total feature vector length is 10 where the numerical is 5 and categorical is 5.

Results from the Decision tree:

Decision Tree feature vector :  [ 0.16263247  0.31856472  0.04314577  0.00312414  0.4657291   0.0068038   0.

  1.          0.          0.        ]

Decision Tree feature vector length:  10

Decision Tree predictions:  [ 20.38783883  20.38783883  23.62423918  19.85791236  17.32767526

  19.33232794  14.96165695  21.1793161   21.5733161   19.33232794

  19.33232794  15.45256604  15.45256604  15.45256604  14.35451409

  14.35451409  14.96165695  30.30945238  29.08578571  30.90028571

  21.5733161   16.07032361  17.32767526  14.96165695  16.07032361

  28.60745238  25.69613095  28.81042857  16.74459301  20.38783883

  14.96165695  21.5733161 ]

Decision Tree depth:  2

Decision Tree depth:  None 

Decision Tree Log - Mean Squared Error 2.99511855857

Decision Tree Log - Mean Squared Error 1.34997530962 

In this case also the mean squared error for the second model is less as compared to the model 1. However the difference in the model is not very high. 

Results from the Gradient boosting Regressor

Decision Tree feature vector :  [ 0.26241313  0.29232947  0.18088804  0.          0.25787725  0.00649211

  1.          0.          0.          0.        ]

Decision Tree feature vector length:  10

Decision Tree predictions:  [ 20.36937837  20.36937837  21.608521    19.67536304  17.93841175

  19.47194207  16.63320595  21.608521    21.608521    18.96648356

  18.96648356  17.93841175  17.93841175  17.93841175  14.747734    14.747734

  16.63320595  27.65988835  26.81396877  27.65988835  21.608521

  17.93841175  17.93841175  16.63320595  17.93841175  26.34583871

  23.69848123  26.34583871  17.05336798  19.86391986  16.63320595

  21.608521  ]

Decision Tree depth:  2

Decision Tree depth:  None 

Decision Tree Log - Mean Squared Error 2.99511855857

Decision Tree Log - Mean Squared Error 1.34997530962 

In this case also the mean squared error are less in the second case as compared to the first case.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2021). Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor. Retrieved from https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html.

"Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html.

My Assignment Help (2021) Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor [Online]. Available from: https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html
[Accessed 23 April 2024].

My Assignment Help. 'Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html> accessed 23 April 2024.

My Assignment Help. Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor [Internet]. My Assignment Help. 2021 [cited 23 April 2024]. Available from: https://myassignmenthelp.com/free-samples/ict707-data-science-practice/second-case.html.

Get instant help from 5000+ experts for
question

Writing: Get your essay and assignment written from scratch by PhD expert

Rewriting: Paraphrase or rewrite your friend's essay with similar meaning at reduced cost

Editing: Proofread your work by experts and improve grade at Lowest cost

loader
250 words
Phone no. Missing!

Enter phone no. to receive critical updates and urgent messages !

Attach file

Error goes here

Files Missing!

Please upload all relevant files for quick & complete assistance.

Plagiarism checker
Verify originality of an essay
essay
Generate unique essays in a jiffy
Plagiarism checker
Cite sources with ease
support
Whatsapp
callback
sales
sales chat
Whatsapp
callback
sales chat
close