Regression Models for Predicting Variable of Interest in essay.

Regression models can be used to predict just about any variable of interest. A few examples include the following:

Predicting stock returns and other economic variables
Predicting loss amounts for loan defaults (this can be combined with a classification model that predicts the probability of default, while the regression model predicts the amount in the case of a default)
Recommendations (the Alternating Least Squares factorization model from Chapter 5, Building a Recommendation Engine with Spark, uses linear regression in each iteration)
Predicting customer lifetime value (CLTV) in a retail, mobile, or other business, based on user behavior and spending patterns

In the different sections of this chapter, we will do the following:

Introduce the various types of regression models available in ML

Explore feature extraction and target variable transformation for regression models
Train a number of regression models using ML
Building a Regression Model with Spark
See how to make predictions using the trained model

Investigate the impact on performance of various parameter settings for regression using cross-validation

Introduction to regression models in ML

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of the week, season, hour of the day etc. can affect the rental behaviors (Fanaee-T, 2014). The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bike share System, Washington D.C., USA which is publicly available in System Data | Capital Bikeshare. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information extracted from https://www.freemeteo.com./.

The requirements for the program to run or to be in effect are basic and can be found on any simple computer. However some of the essential requirements include;

Hardware which can be either a desktop or a laptop.
Software required is Apache Spark, PySpark, Matplotlib, Pylab and numpy python libraries.
Python program supports any platform such as Ubuntu, windows 10 or Mac Os.

Regression models define the relationship between a dependent variable and one or more independent variables (Kutner, 2004). They are concerned with target variables that can take any variable. The underlying principle is to find a model that maps input features to predicted target variables. A few examples of where regression models are used include;

Predicting stock returns and other economic variables
Predicting customer lifetime value in a retail, mobile, or other business based on user behavior and spending patterns.
Predicting loss defaults and many others.

In our program, we are predicting bike rental count based on environmental and seasonal settings using PySpark, Python (Srinivasa, 2015) by combining past rental patterns with historical rental data to forecast rental demand.

#This program helps in prediction of future bike rental demands

import matplotlib

from pylab import hist

from pyspark.mllib.regression import LabeledPoint

import numpy as np

from pyspark.mllib.regression import LinearRegressionWithSGD, RidgeRegressionWithSGD

from pyspark.mllib.tree import DecisionTree, GradientBoostedTrees

from pyspark import SparkContext

spark = SparkContext('local', 'Assignment Project', '/usr/spark-hadoop')

raw_data = spark.textFile('/home/livinggoods/Projects/Others/regression_modelling/data/hour-noheader.csv')

print raw_data.count()

data_count = raw_data.count()

records = raw_data.map(lambda x: x.split(','))

first = records.first()

print first

print data_count

records.cache()

def get_mapping(rdd, idx):

return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()

print "Mapping of first categorical feature column: %s" % get_mapping(records, 2)

mappings = [get_mapping(records, i) for i in range(2,10)]

cat_len = sum(map(len, mappings))

num_len = len(records.first()[11:15])

total_len = num_len + cat_len

print "Feature vector length for categorical features: %d" % cat_len

print "Feature vector length for numerical features: %d" % num_len

print "Total feature vector length: %d" % total_len

def extract_features(record):

cat_vec = np.zeros(cat_len)

i = 0

step = 0

for field in record[2:9]:

m = mappings[i]

idx = m[field]

cat_vec[idx + step] = 1

i = i + 1

step = step + len(m)

num_vec = np.array([float(field) for field in record[10:14]])

return np.concatenate((cat_vec, num_vec))

def extract_label(record):

return float(record[-1])

data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))

first_point = data.first()

print "Raw data: " + str(first[2:])

print "Label: " + str(first_point.label)

print "Linear Model feature vector:n" + str(first_point.features)

for field in record[2:9]:

m = mappings[i]

idx = m[field]

cat_vec[idx + step] = 1

i = i + 1

step = step + len(m)

Impact of parameter settings on performance of regression

num_vec = np.array([float(field) for field in record[10:14]])

return np.concatenate((cat_vec, num_vec))

def extract_label(record):

return float(record[-1])

def evaluate(train, test, iterations, step, regParam, regType, intercept):

model = LinearRegressionWithSGD.train(train, iterations, step,regParam=regParam, regType=regType, intercept=intercept)

tp = test.map(lambda p: (p.label, model.predict(p.features)))

rmsle = np.sqrt(tp.map(lambda (t, p): squared_log_error(t, p)).mean())

return rmsle

def extract_features_dt(record):

reurn np.array(map(float, record[2:14]))

def squared_error(actual, pred):

return (pred - actual)**2

def abs_error(actual, pred):

return np.abs(pred - actual)

def squared_log_error(pred, actual):

return (np.log(pred + 1) - np.log(actual + 1))**2

print "Mapping of first categorical feature column: %s" % get_mapping(records, 2)

mappings = [get_mapping(records, i) for i in range(2,10)]

cat_len = sum(map(len, mappings))

num_len = len(records.first()[11:15])

total_len = num_len + cat_len

print "Feature vector length for categorical features: %d" % cat_len

print "Feature vector length for numerical features: %d" % num_len

print "Total feature vector length: %d" % total_len

data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))

first_point = data.first()

print "Raw data: " + str(first[2:])

print "Label: " + str(first_point.label)

print "Linear Model feature vector:n" + str(first_point.features)

print "Linear Model feature vector length: " + str(len(first_point.features))

data_dt = records.map(lambda r: LabeledPoint(extract_label(r), extract_features_dt(r)))

first_point_dt = data_dt.first()

print "Decision Tree feature vector: " + str(first_point_dt.features)

print "Decision Tree feature vector length: " + str(len(first_point_dt.features))

linear_model = LinearRegressionWithSGD.train(data, iterations=10, step=0.1, intercept=False)

true_vs_predicted = data.map(lambda p: (p.label, linear_model.predict(p.features)))

print "Linear Model predictions: " + str(true_vs_predicted.take(5))

dt_model = DecisionTree.trainRegressor(data_dt,{})

preds = dt_model.predict(data_dt.map(lambda p: p.features))

actual = data.map(lambda p: p.label)

true_vs_predicted_dt = actual.zip(preds)

print "Decision Tree predictions: " + str(true_vs_predicted_dt.take(5))

print "Decision Tree depth: " + str(dt_model.depth())

print "Decision Tree number of nodes: " + str(dt_model.numNodes())

mse = true_vs_predicted.map(lambda (t, p): squared_error(t,p)).mean()

mae = true_vs_predicted.map(lambda (t, p): abs_error(t, p)).mean()

rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Linear Model - Mean Squared Error: %2.4f" % mse

print "Linear Model - Mean Absolute Error: %2.4f" % mae

print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle

mse_dt = true_vs_predicted_dt.map(lambda (t, p): squared_error(t,p)).mean()

mae_dt = true_vs_predicted_dt.map(lambda (t, p): abs_error(t, p)).mean()

rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Decision Tree - Mean Squared Error: %2.4f" % mse_dt

print "Decision Tree - Mean Absolute Error: %2.4f" % mae_dt

print "Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_d

targets = records.map(lambda r: float(r[-1])).collect(

hist(targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

log_targets = records.map(lambda r: np.log(float(r[-1]))).collect()

hist(log_targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

sqrt_targets = records.map(lambda r: np.sqrt(float(r[-1]))).collect()

hist(sqrt_targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

data_log = data.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))

model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)

true_vs_predicted_log = data_log.map(lambda p: (np.exp(p.label), np.exp(model_log.predict(p.features))))

mse_log = true_vs_predicted_log.map(lambda (t, p): squared_error(t, p)).mean()

mae_log = true_vs_predicted_log.map(lambda (t, p): abs_error(t, p)).mean()

rmsle_log = np.sqrt(true_vs_predicted_log.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Mean Squared Error: %2.4f" % mse_log

print "Mean Absolue Error: %2.4f" % mae_log

print "Root Mean Squared Log Error: %2.4f" % rmsle_log

print "Non log-transformed predictions:n" + str(true_vs_predicted.take(3))

print "Log-transformed predictions:n" + str(true_vs_predicted_log.take(3))

data_dt_log = data_dt.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))

dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})

preds_log = dt_model_log.predict(data_dt_log.map(lambda p: p.features))

actual_log = data_dt_log.map(lambda p: p.label)

true_vs_predicted_dt_log = actual_log.zip(preds_log).map(lambda (t, p): (np.exp(t), np.exp(p)))

mse_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): squared_error(t, p)).mean()

mae_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): abs_error(t, p)).mean()

rmsle_log_dt = np.sqrt(true_vs_predicted_dt_log.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Mean Squared Error: %2.4f" % mse_log_dt

print "Mean Absolue Error: %2.4f" % mae_log_dt

print "Root Mean Squared Log Error: %2.4f" % rmsle_log_dt

print "Non log-transformed predictions:n" + str(true_vs_predicted_dt.take(3))

print "Log-transformed predictions:n" + str(true_vs_predicted_dt_log.take(3))

data_with_idx = data.zipWithIndex().map(lambda (k, v): (v, k))

test = data_with_idx.sample(False, 0.2, 42)

train = data_with_idx.subtractByKey(test)

train_data = train.map(lambda (idx, p): p)

test_data = test.map(lambda (idx, p) : p)

train_size = train_data.count()

test_size = test_data.count()

print "Training data size: %d" % train_size

print "Test data size: %d" % test_size

print "Total data size: %d " % data_count

print "Train + Test size : %d" % (train_size + test_size)

data_with_idx_dt = data_dt.zipWithIndex().map(lambda (k, v): (v, k))

test_dt = data_with_idx_dt.sample(False, 0.2, 42)

train_dt = data_with_idx_dt.subtractByKey(test_dt)

train_data_dt = train_dt.map(lambda (idx, p): p)

test_data_dt = test_dt.map(lambda (idx, p) : p)

params = [1, 5, 10, 20, 50, 100]

metrics = [evaluate(train_data, test_data, param, 0.01, 0.0, 'l2', False) for param in params]

print params

print metrics

matplotlib.pyplot.plot(params, metrics)

fig = matplotlib.pyplot.gcf()

matplotlib.pyplot.xscale('log')

The output is as below:

Conclusion

Therefore we can conclude that the program has been successful for what it aims to perform. This program can be used in a bike rental organization to predict future rental demands and also in event and anomaly detections.

References

Fanaee-T, H. a. G. J., 2014. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, (2-3)( 2), pp. pp.113-127.

Kutner, M. N. C. a. N. J., 2004. Applied linear regression models. McGraw-Hill/Irwin: s.n.

Srinivasa, K. a. M. A., 2015. Getting Started with Spark. In: C. Springer, ed. Guide to High Performance Distributed Computing . s.l.:s.n., pp. (pp. 73-99).

Other sources from the internet include: