Regression models can be used to predict just about any variable of interest. A few examples include the following:
- Predicting stock returns and other economic variables
- Predicting loss amounts for loan defaults (this can be combined with a classification model that predicts the probability of default, while the regression model predicts the amount in the case of a default)
- Recommendations (the Alternating Least Squares factorization model from Chapter 5, Building a Recommendation Engine with Spark, uses linear regression in each iteration)
- Predicting customer lifetime value (CLTV) in a retail, mobile, or other business, based on user behavior and spending patterns
In the different sections of this chapter, we will do the following:
Introduce the various types of regression models available in ML
- Explore feature extraction and target variable transformation for regression models
- Train a number of regression models using ML
- Building a Regression Model with Spark
- See how to make predictions using the trained model
Investigate the impact on performance of various parameter settings for regression using cross-validation
Introduction to regression models in ML
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of the week, season, hour of the day etc. can affect the rental behaviors (Fanaee-T, 2014). The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bike share System, Washington D.C., USA which is publicly available in System Data | Capital Bikeshare. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information extracted from https://www.freemeteo.com./.
The requirements for the program to run or to be in effect are basic and can be found on any simple computer. However some of the essential requirements include;
- Hardware which can be either a desktop or a laptop.
- Software required is Apache Spark, PySpark, Matplotlib, Pylab and numpy python libraries.
- Python program supports any platform such as Ubuntu, windows 10 or Mac Os.
Regression models define the relationship between a dependent variable and one or more independent variables (Kutner, 2004). They are concerned with target variables that can take any variable. The underlying principle is to find a model that maps input features to predicted target variables. A few examples of where regression models are used include;
- Predicting stock returns and other economic variables
- Predicting customer lifetime value in a retail, mobile, or other business based on user behavior and spending patterns.
- Predicting loss defaults and many others.
In our program, we are predicting bike rental count based on environmental and seasonal settings using PySpark, Python (Srinivasa, 2015) by combining past rental patterns with historical rental data to forecast rental demand.
#This program helps in prediction of future bike rental demands
import matplotlib
from pylab import hist
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pyspark.mllib.regression import LinearRegressionWithSGD, RidgeRegressionWithSGD
from pyspark.mllib.tree import DecisionTree, GradientBoostedTrees
from pyspark import SparkContext
spark = SparkContext('local', 'Assignment Project', '/usr/spark-hadoop')
raw_data = spark.textFile('/home/livinggoods/Projects/Others/regression_modelling/data/hour-noheader.csv')
print raw_data.count()
data_count = raw_data.count()
records = raw_data.map(lambda x: x.split(','))
first = records.first()
print first
print data_count
records.cache()
def get_mapping(rdd, idx):
return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()
print "Mapping of first categorical feature column: %s" % get_mapping(records, 2)
mappings = [get_mapping(records, i) for i in range(2,10)]
cat_len = sum(map(len, mappings))
num_len = len(records.first()[11:15])
total_len = num_len + cat_len
print "Feature vector length for categorical features: %d" % cat_len
print "Feature vector length for numerical features: %d" % num_len
print "Total feature vector length: %d" % total_len
def extract_features(record):
cat_vec = np.zeros(cat_len)
i = 0
step = 0
for field in record[2:9]:
m = mappings[i]
idx = m[field]
cat_vec[idx + step] = 1
i = i + 1
step = step + len(m)
num_vec = np.array([float(field) for field in record[10:14]])
return np.concatenate((cat_vec, num_vec))
def extract_label(record):
return float(record[-1])
data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))
first_point = data.first()
print "Raw data: " + str(first[2:])
print "Label: " + str(first_point.label)
print "Linear Model feature vector:n" + str(first_point.features)
for field in record[2:9]:
m = mappings[i]
idx = m[field]
cat_vec[idx + step] = 1
i = i + 1
step = step + len(m)
Impact of parameter settings on performance of regression
num_vec = np.array([float(field) for field in record[10:14]])
return np.concatenate((cat_vec, num_vec))
def extract_label(record):
return float(record[-1])
def evaluate(train, test, iterations, step, regParam, regType, intercept):
model = LinearRegressionWithSGD.train(train, iterations, step,regParam=regParam, regType=regType, intercept=intercept)
tp = test.map(lambda p: (p.label, model.predict(p.features)))
rmsle = np.sqrt(tp.map(lambda (t, p): squared_log_error(t, p)).mean())
return rmsle
def extract_features_dt(record):
reurn np.array(map(float, record[2:14]))
def squared_error(actual, pred):
return (pred - actual)**2
def abs_error(actual, pred):
return np.abs(pred - actual)
def squared_log_error(pred, actual):
return (np.log(pred + 1) - np.log(actual + 1))**2
print "Mapping of first categorical feature column: %s" % get_mapping(records, 2)
mappings = [get_mapping(records, i) for i in range(2,10)]
cat_len = sum(map(len, mappings))
num_len = len(records.first()[11:15])
total_len = num_len + cat_len
print "Feature vector length for categorical features: %d" % cat_len
print "Feature vector length for numerical features: %d" % num_len
print "Total feature vector length: %d" % total_len
data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))
first_point = data.first()
print "Raw data: " + str(first[2:])
print "Label: " + str(first_point.label)
print "Linear Model feature vector:n" + str(first_point.features)
print "Linear Model feature vector length: " + str(len(first_point.features))
data_dt = records.map(lambda r: LabeledPoint(extract_label(r), extract_features_dt(r)))
first_point_dt = data_dt.first()
print "Decision Tree feature vector: " + str(first_point_dt.features)
print "Decision Tree feature vector length: " + str(len(first_point_dt.features))
linear_model = LinearRegressionWithSGD.train(data, iterations=10, step=0.1, intercept=False)
true_vs_predicted = data.map(lambda p: (p.label, linear_model.predict(p.features)))
print "Linear Model predictions: " + str(true_vs_predicted.take(5))
dt_model = DecisionTree.trainRegressor(data_dt,{})
preds = dt_model.predict(data_dt.map(lambda p: p.features))
actual = data.map(lambda p: p.label)
true_vs_predicted_dt = actual.zip(preds)
print "Decision Tree predictions: " + str(true_vs_predicted_dt.take(5))
print "Decision Tree depth: " + str(dt_model.depth())
print "Decision Tree number of nodes: " + str(dt_model.numNodes())
mse = true_vs_predicted.map(lambda (t, p): squared_error(t,p)).mean()
mae = true_vs_predicted.map(lambda (t, p): abs_error(t, p)).mean()
rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Linear Model - Mean Squared Error: %2.4f" % mse
print "Linear Model - Mean Absolute Error: %2.4f" % mae
print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle
mse_dt = true_vs_predicted_dt.map(lambda (t, p): squared_error(t,p)).mean()
mae_dt = true_vs_predicted_dt.map(lambda (t, p): abs_error(t, p)).mean()
rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Decision Tree - Mean Squared Error: %2.4f" % mse_dt
print "Decision Tree - Mean Absolute Error: %2.4f" % mae_dt
print "Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_d
targets = records.map(lambda r: float(r[-1])).collect(
hist(targets, bins=40, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
log_targets = records.map(lambda r: np.log(float(r[-1]))).collect()
hist(log_targets, bins=40, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
sqrt_targets = records.map(lambda r: np.sqrt(float(r[-1]))).collect()
hist(sqrt_targets, bins=40, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
data_log = data.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))
model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)
true_vs_predicted_log = data_log.map(lambda p: (np.exp(p.label), np.exp(model_log.predict(p.features))))
mse_log = true_vs_predicted_log.map(lambda (t, p): squared_error(t, p)).mean()
mae_log = true_vs_predicted_log.map(lambda (t, p): abs_error(t, p)).mean()
rmsle_log = np.sqrt(true_vs_predicted_log.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Mean Squared Error: %2.4f" % mse_log
print "Mean Absolue Error: %2.4f" % mae_log
print "Root Mean Squared Log Error: %2.4f" % rmsle_log
print "Non log-transformed predictions:n" + str(true_vs_predicted.take(3))
print "Log-transformed predictions:n" + str(true_vs_predicted_log.take(3))
data_dt_log = data_dt.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))
dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})
preds_log = dt_model_log.predict(data_dt_log.map(lambda p: p.features))
actual_log = data_dt_log.map(lambda p: p.label)
true_vs_predicted_dt_log = actual_log.zip(preds_log).map(lambda (t, p): (np.exp(t), np.exp(p)))
mse_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): squared_error(t, p)).mean()
mae_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): abs_error(t, p)).mean()
rmsle_log_dt = np.sqrt(true_vs_predicted_dt_log.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Mean Squared Error: %2.4f" % mse_log_dt
print "Mean Absolue Error: %2.4f" % mae_log_dt
print "Root Mean Squared Log Error: %2.4f" % rmsle_log_dt
print "Non log-transformed predictions:n" + str(true_vs_predicted_dt.take(3))
print "Log-transformed predictions:n" + str(true_vs_predicted_dt_log.take(3))
data_with_idx = data.zipWithIndex().map(lambda (k, v): (v, k))
test = data_with_idx.sample(False, 0.2, 42)
train = data_with_idx.subtractByKey(test)
train_data = train.map(lambda (idx, p): p)
test_data = test.map(lambda (idx, p) : p)
train_size = train_data.count()
test_size = test_data.count()
print "Training data size: %d" % train_size
print "Test data size: %d" % test_size
print "Total data size: %d " % data_count
print "Train + Test size : %d" % (train_size + test_size)
data_with_idx_dt = data_dt.zipWithIndex().map(lambda (k, v): (v, k))
test_dt = data_with_idx_dt.sample(False, 0.2, 42)
train_dt = data_with_idx_dt.subtractByKey(test_dt)
train_data_dt = train_dt.map(lambda (idx, p): p)
test_data_dt = test_dt.map(lambda (idx, p) : p)
params = [1, 5, 10, 20, 50, 100]
metrics = [evaluate(train_data, test_data, param, 0.01, 0.0, 'l2', False) for param in params]
print params
print metrics
matplotlib.pyplot.plot(params, metrics)
fig = matplotlib.pyplot.gcf()
matplotlib.pyplot.xscale('log')
The output is as below:
Conclusion
Therefore we can conclude that the program has been successful for what it aims to perform. This program can be used in a bike rental organization to predict future rental demands and also in event and anomaly detections.
References
Fanaee-T, H. a. G. J., 2014. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, (2-3)( 2), pp. pp.113-127.
Kutner, M. N. C. a. N. J., 2004. Applied linear regression models. McGraw-Hill/Irwin: s.n.
Srinivasa, K. a. M. A., 2015. Getting Started with Spark. In: C. Springer, ed. Guide to High Performance Distributed Computing . s.l.:s.n., pp. (pp. 73-99).
Other sources from the internet include:
- System Data | Capital Bikeshare
- https://www.freemeteo.com.
To export a reference to this article please select a referencing stye below:
My Assignment Help. (2020). Regression Models For Predicting Variable Of Interest In Essay.. Retrieved from https://myassignmenthelp.com/free-samples/ict707-data-science-practice/performance-of-various-parameter.html.
"Regression Models For Predicting Variable Of Interest In Essay.." My Assignment Help, 2020, https://myassignmenthelp.com/free-samples/ict707-data-science-practice/performance-of-various-parameter.html.
My Assignment Help (2020) Regression Models For Predicting Variable Of Interest In Essay. [Online]. Available from: https://myassignmenthelp.com/free-samples/ict707-data-science-practice/performance-of-various-parameter.html
[Accessed 18 December 2024].
My Assignment Help. 'Regression Models For Predicting Variable Of Interest In Essay.' (My Assignment Help, 2020) <https://myassignmenthelp.com/free-samples/ict707-data-science-practice/performance-of-various-parameter.html> accessed 18 December 2024.
My Assignment Help. Regression Models For Predicting Variable Of Interest In Essay. [Internet]. My Assignment Help. 2020 [cited 18 December 2024]. Available from: https://myassignmenthelp.com/free-samples/ict707-data-science-practice/performance-of-various-parameter.html.