\$20 Bonus + 25% OFF

# CSC 411 Machine Learning And Data Mining

0 Download8 Pages / 1,856 Words Add in library Click this icon and make it bookmark in your library to refer it later. GOT IT
• Course Code: CSC 411
• University: University Of Toronto Mississauga

## Questions:

In this question you will generate and plot 2-dimensional data for a binary classification problem. We will call the two classes Class 0 and Class 1 (for which the target values are t = 0 and t = 1, respectively).

(a) (6 points) Write a Python function genData(mu0,mu1,Sigma0,Sigma1,N) that generates two clusters of data, one for each class. Each cluster consists of N data points. The cluster for class 0 is centred at mu0 and has covariance matrix Sigma0. The cluster for class 1 is centred at mu1 and has covariance matrix Sigma1. Note that mu0 and mu1 and all the data points are 2-dimensional vectors. Sigma0 and Sigma1 are 2 × 2 symmetric matrices that describe the shape of the clusters: the diagonal entries specify the variance of a cluster along each of the two dimensions, and the off-diagonal entries describe how correlated the two dimensions are. The function should return two arrays, X and t, representing data points and target values, respectively. X is a 2N × 2 dimensional array in which each row is a data point. t is a 2N-dimensional vector of 0s and 1s. Specifically, t[i] is 0 if X[i] belongs to class 0, and 1 if it belongs to class 1. The data for the two classes should be distributed randomly in the arrays. In particular, the data for class 0 should not all be in the first half of the arrays, with the data for class 1 in the second half.

We will model each cluster as a multivariate normal distribution. Recall that the probability density of such a distribution is given bywhere µ is the mean (cluster centre), Σ is the covariance matrix, and k is the dimensionaliy of the data (2 in our case). To generate data for a cluster, use the function multivariate normal in numpy.random. Use the function shuffle in sklearn.utils to distribute the data randomly in the arrays.

(b) (1 point) Use your function from part (a) to generate two clusters with 10,000 points each with mu0 = (0, −1), mu1 = (−1, 1) and Sigma0 = 2.0 0.5 0.5 1.0 !
Sigma1 = 1.0 −1.0
−1.0 2.0
!
You will have to encode these argument values as Numpy arrays.

(c) Display the data from part (b) as a scatter plot, using red dots for points in cluster 0, and blue dots for points in cluster 1. Use the function scatter in numpy.pyplot. Specify a relatively small dot size by using the named argument s=2. Use the functions xlim and ylim to extend the x and y axes from -5 to 6. Title the plot, “Question 1(c): sample cluster data (10,000 points per cluster)”. If you have done everything correctly, the scatter plot should look something like Figure 1, which shows two heavily overlapping clusters. In particular, the2. (?? points) Binary Logistic Regression.
In this question you will use logistic regression to generate a classifier for cluster data. You will also generate a precision-recall curve for the classifier. Use the Python class LogisticRegression in sklearn.linear model to do the logistic regression. This class generates a Python object, much as the function Ridge did in Question 5 of 4
Assignment 1. The class comes with a number of attributes and methods that you will find useful for answering the questions below.

(a) Use genData to generate training data consisting of two clusters with 1000 points each. Use the same cluster centers and covariance matrices as in Question 1(b).
(b) Carry out logistic regression on the data in part (a). Print out the values of the bias term, w0, the weight vector, w, and the mean accuracy of the classifier on the training data. (Accuracy is the number of correct predictions.)

(c) Generate a scatter plot of the taining data as in Question 1(c), and draw the decision boundary of the classifier as a black line on top of the data. Title the figure, “Question 2(c): training data and decision boundary”.

(d) Recall that the standard decision boundary tends to make the number of false positives equal to the number of false negatives. However, these two kinds of error may have different costs, and we may want to shift the decision boundary to account for this. That is, instead of defining the decision boundary by w T x+w0 =0, we may want to define it by w T x + w0 = t for some threshold, t.Generate a scatter plot of the data, and plot seven different decision boundaries on top of it, for t = 3, 2, 1, 0, −1, −2, −3. Plot the decision boundary as a blue line when t is positive, as a red line when t is negative, and as a black line when t is 0. Title the figure, “Question 2(d): decision boundaries for seven thresholds”.

(e) Which of the seven values of t in part (d) gives the greatest number of false positives (i.e., false blue predictions)? Explain your answer.

(f) For t = 1, what is the probability of a point on the decision boundary being in class 1 (i.e., blue).

(g) Use genData to generate test data consisting of two clusters with 10,000 points each. Use the same cluster centers and covariance matrices as for the training data.

(h) Use the test data to compute and print out the following values for t = 1:
• The number of predicted positives (i.e., points predicted to be in class 1)
• The number of predicted negatives (i.e., points predicted to be in class 0)
• The number of true positives (i.e., predictions for class 1 that are correct).
• The number of false postives (i.e., predictions for class 1 that are incorrect)
• The number of true negatives (i.e., predictions for class 0 that are correct)
• The number of false negatives (i.e., predictions for class 0 that are incorrect).
• The precision.
• The recall.

The number of predicted positives should be less than the number of predicted negatives. The number of true positives should be much greater than the number of false positives. (Explain both of these points, generating an appropriate figure to simplify your explanation. Title the figure, “Question 2(h): explanatory figure”.)

5
(i) Use the test data to generate a precision/recall curve for the classifier. That is, plot precision vs recall for 1000 different values of the threshold, t. You should choose the range of t values so that the curve is as long as possible. You should find that 0.5 ≤ precision ≤ 1 and 0 ≤ recall ≤ 1. The result should look something like Figure 2 (although the minimum precision in this curve is different). Label the axes, and title the figure, “Question 2(i): precision/recall curve”.
(j) Explain why the minimum precision is 0.5.
(k) Compute and print the area under the precision/recall curve. The area should be between 0.5 and 1.0. (Recall that the area under a curve (AUC) is the area between the curve and the x axis.)

(l) Explain why the area under the curve must betwen 0.5 and 1.0. (You may want to include a figure in your explanation. If so, title it, “Question 2(l): explanatory figure”.)

3. (?? points total) Multi-class Classification. In this question, you will use logistic regression and K nearest neighbors (KNN) to classify images of handwritten digits. There are ten different digits (0 to 9), so you will be using multi-class classification. To start, download and uncompress (if necessary) the MNIST data file from the course
web page. The file, called mnist.pickle.zip, contains training and test data. Next, start the Python interpreter and import the pickle module. You can then read the file
mnist.pickle with the following command (’rb’ opens the file for reading in binary):

with open(’mnist.pickle’,’rb’) as f: Xtrain,Ytrain,Xtest,Ytest = pickle.load(f) The variables Xtrain and Ytrain contain training data, while Xtest and Yest contain test data. Use this data for training and testing in this question and in the rest of this assignment. Xtrain is a Numpy array with 60,000 rows and 784 columns. Each row represents a hand-written digit. Although each digit is stored as a row vector with 784 components, it actually represents an array of pixels with 28 rows and 28 columns (784 = 28 × 28). Each pixel is stored as a floating-point number, but has an integer value between 0 and 255 (i.e., the values representable in a single byte). The variable Ytrain is a vector of 60,000 image labels, where a label is an integer betwen 0 and 9. For example, if row n of Xtrain is an image of the digit 7, then Ytrain[n] = 7. Likewise for Xtest and Ytest, which represent 10,000 test images. To view a digit, you must first convert it to a 28 × 28 array using the function numpy.reshape. To display a 2-dimensional array as an image, you can use the function imshow in matplotlib.pyplot. To see an image in black-and-white, add the keyword argument cmap=’Greys’ to imshow. To remove the smoothing and see the 784 pixels clearly, add the keyword argument interpolation=’nearest’. Try displaying a few digits as images. (Figure 3 shows an example.) For comparison, try printing them as vectors. (Do not hand this in.)

### 1. Introduction

In this project to compute the binary regression, logistic regression and generate the data using python code. We are using the mnist pickle file to run the program. We are developing the matrix multiplication, multi- class classification, binary regression and logistic regression using python. It has the separated file and using python code to develop the task. The first task is to develop the matrix multiplication and display the scatter plot. The second task to develop the binary logistic regression for generates the data and it displays the scatter plot. The third task to develop the multi classification and using the mniist file to generate the recall curve. The fourth task is logistic regression and it calculates the entropy by using the formula. The fifth task is softmax and we are calculate the cross entropy related to task 4.The sixth task is batch gradient descent and it develop the muliti class classification. The seventh task is stochastic gradient descent and it displays the output to relate to the batch size.

### 2. Matrix Multiplication

1. In the first task to create the matrix multiplication and it calculate the values. We are create the A×A matrix. The matrix begins at 0 not 1.The first element has the 7 row and 0 column. And the second element has the 0 row and 4 columns. We are using the numpy array for create the element wise multiplication. Consider the two dimensional numpy array A and B. It perform the matrix multiplication using the python code. In additionally we using the vector. The array and the vector can be executed in parallel. If A is matrix in row and v is vector for column (Allison and Allison, 2012). If A is matrix in column and v is vector in row. We adding the matrix and vector like A+V to every element. It also avoiding the iterating function.
1. Finally it displays the output in scatter plot. The task includes the additional task. For binary classification problem, first generate the data and plot the two dimensional data. The array include the function and the function should return the array x and t.And the function generate the two cluster like sigma0 and sigma1.
1. And the data display the scatter plot using red dots and blue dots. The red and blue dots denoted as the cluster. Red dots for points in cluster 0 and blue dots for points in cluster 1.

### 3. Binary Logistic Regression

To generate the classifier of two cluster data. It also generates the precision recall curve. In the task first generate the data and calculate the covariance matrix for same cluster. We are using the t value to generate the data and it displays the graph for the points. The predict value has negative value, positive value, true positive value, true negative value, false negative and false positive value (Feng et al., 2016). Float point constant is 0.4f and 0.5f.The 0.4 size of a double is 4 bytes and o.5f size of a double size is 5 bytes. The 0.5 float has an exact binary representation 0.1 and 0.4 does not have an exact binary representation 0.01101100.so 0.5 is true and 0.4 is false. So we take the minimum precision value 0.5.

### 4. Multi - Class Classification

We are using the logistic regression and K nearest neighbour. We are import the mnist image and classify the image. The file has the xtrain, ytrain, xtest and ytest data.KNN means it identifying the data using number of clusters. In the data have the 60000 rows and 784 columns. The images ate convert to the 28×28 array and the values between the o and 255.In the mnist file has the multiple image (Gusev and Ristov, 2013). The multiple image display the single imager using python code and it arrange 6×6 grid format. It also prints the test accuracy using the training dataset. We take the k value up to 20. We find the test accuracy using the k values.

### 5. Logistic Regression

In this task we are using the training data. We take the real vector xn and binary vector tn.We calculate the cross entropy using the vectors. We give some value to check it true or false. We give the prediction value and epsilon value to calculate the answer is correct or not. The cross entropy means the machine learning and optimization defied by the loss function. The value defined by the predict value (Huang, 2012). The predict values are negative value, positive value, true negative value, true positive value, false negative value and false positive value.

Proof

1. a) The second last equation
1. b) Let assume the X,T AND Y value.

T  (Y – T)

Assume Xt  =  ,Y =  and T =

Apply the value

1. c) We take i=0

Then

### 6. Softmax

In this task to implement the multi class logistic regression. First implement the softmax function. To implement the softmax function using the multi class regression and it is a generalization of logistic regression (Kleinbaum and Klein, 2011).Consider the z and v.Z is a k diamensional vector and y is equal to the softmax of z.We using the k dimensional formula for implement the softmax function. Consider the two softmax function. Softmax of z and softmax1 of z.The softmax function return the value 0.5, 0.5.

The first cause to return the value and second cause warning. It return the nan, 0.The third cause should return –inf,0 function . We compute the first element value and it return the 0.5 value The second task we do the transform vector z and z’.

b (i) Consider the z=exi  and z’=exi-m

Hence prove the softmax (z’)=softmax(z)

b (ii) Consider the log function

Li= -log(yk)

We compute the derivation

=   (Pk *(1 – Pk))

= (Pk  - 1)

We compute the softmax function using the python code. We take the two vectors logy and logy. It should return the y and logy function. The y considers to the softmax of z and logy is consider to the log of softmax of x.

In the task related to task 4 and task 5.The binary logistic regression and muliti class classification using the batch gradient desent.The batch gradient descent means find the balance between the stochastic gradient descent and efficacy of the gradient descent. We are using the training dataset to implement the batch gradient descent. The compute the gradients and update the weight. We test the mnist data and it displays the scatter plot. We also define the cross entropy per data point (ZHENG and LUO, 2013). Consider the N and N means the number of data point in the sum. We are implementing the weights and find the Gaussian distribution function. We take the mean value 0 and standard deviation value 0.01.The weight values are repeated up to 5000 values. It compute the training loss, test loss mean training accuracy and test loss and accuracy.Traning loss means It build a model and find out the loss for the function. Additionally we plot the graph for 200 training loss. It uses to find out the training accuracy and training loss. It also calculates the test loss and test accuracy.

We take the 200 training data to implement the loos function.

We take the 2000 training data for loss function.

In this task we use the batch size. It performs the 500 training data. It computes the training accuracy and test accuracy. Stochastic gradient descent is an iterative method and it also called as the incremental gradient descent. We take the training dataset and it using to calculate the iteration function in the dataset. It is one of the machine learning problem and it solve the mathematically. But we are using python code for this function. In the python code calculates the number of iteration in the function and it solves the problem. Consider the mini batch of the training dataset and it computes the gradient on a small. It updates the weight and then move from one mini batch to another mini batch. The mini batch size is 100 and it takes 100 points for first execution. Every execution has 100 points. The mini batch has 500 batch size. It is step by step procedurte.First it compute the gradient and it update the weight. And then compute the training accurancy, traning loss and test accurancy, test loss. We plot the graph for every dataset in single graph. If the test accuracy increase then the process are decrese.We are already plot the training loss and test loss. The blue dots denote as the test loss and red dots denotes as the training loss. Finally print the all values for all graphs. The training accuracy is greater than the test accuracy and training loss is less than the test loss. The graph is depending upon the batch size. The batch size is increase the graph is also take different value.

## Conclusion

In this project to solve the matrix mulitiplication, binary regressions and logistic regression using python. We using the training data and find the training loss, training accuracy and test loss, test accuracy. In the problem are satisfied by the python code. In the first task calculate the matrix multiplication by using python code and also implement the sigma values. In the second task is generating the regression problem. Generate the training dataset calculate the covariance matrix. The values are depending the t values. In the third task is combine the all the image and it display the single image. We are using the mnist dataset for the task. In the fourth task is calculate the cross entropy in mathematically. In the fifth task is softmax.It depending the multi class logistic regression. We using the two softmax value and generate by the python code. In the sixth task is batch gradient descent. It is based on the multi class logistic regression and gradient descent. We are using the minst dataset for the task. It also implement by the python code. We find the training accuracy dataset is greater than the test accuracy and training loss is less than the test loss. We plot the graph depend the mnist dataset. In the last task is stochastic gradient descent. It depending the batch size and plot the graph depending the batch size. It also computes the training accuracy, test accuracy and training loss, test loss.

## References

Allison, P. and Allison, P. (2012). Logistic regression using SAS. Cary, NC: SAS Institute.

Feng, W., Sarkar, A., Lim, C. and Maiti, T. (2016). Variable selection for binary spatial regression: Penalized quasi-likelihood approach. Biometrics, 72(4), pp.1164-1172.

Gusev, M. and Ristov, S. (2013). A superlinear speedup region for matrix multiplication. Concurrency and Computation: Practice and Experience, 26(11), pp.1847-1868.

Huang, T. (2012). Neural information processing. Heidelberg: Springer.

Kleinbaum, D. and Klein, M. (2011). Logistic regression. New York: Springer.

ZHENG, X. and LUO, Y. (2013). Improved clonal selection algorithm for multi-class data classification. Journal of Computer Applications, 32(11), pp.3201-3205.

### Cite This Work

My Assignment Help. (2019). Machine Learning And Data Mining. Retrieved from https://myassignmenthelp.com/free-samples/csc-411-machine-learning-and-data-mining.

"Machine Learning And Data Mining." My Assignment Help, 2019, https://myassignmenthelp.com/free-samples/csc-411-machine-learning-and-data-mining.

My Assignment Help (2019) Machine Learning And Data Mining [Online]. Available from: https://myassignmenthelp.com/free-samples/csc-411-machine-learning-and-data-mining
[Accessed 10 August 2020].

My Assignment Help. 'Machine Learning And Data Mining' (My Assignment Help, 2019) <https://myassignmenthelp.com/free-samples/csc-411-machine-learning-and-data-mining> accessed 10 August 2020.

My Assignment Help. Machine Learning And Data Mining [Internet]. My Assignment Help. 2019 [cited 10 August 2020]. Available from: https://myassignmenthelp.com/free-samples/csc-411-machine-learning-and-data-mining.

### Latest Management Samples

• Course Code: ENGT5219
• University: De Montfort University
• Country: United Kingdom

Answer: Introduction The most serious issue that the nations around the world are dealing with is the climatic change and its impact on the people as well as the business operating. UK government has developed industrial strategy along with the private organisations which has identified four grand challenges including promoting the use of AI and data structure,catering the needs of ageing society, development of better road and rail facilitie...

#### MORG4038 Management And Organisations Spring 2018

Answer: Introduction The report highlights facts about the company Snap Inc. The report discusses the matter related to the structure of the organization, its market value, scope and scale of its business operations. The elements used in the organization have been discussed here in the report. The modern approach used by the management in the company. Also the innovative strategic objectives of management are discussed in the report (Seidl, 2...

#### ECCDD301A Visual And Media Arts

• Course Code: ECCDD301A
• University: Tafe NSW
• Country: Australia

Answer: Early Childhood Education: An Analysis Of The Importance Of Visual And Media Arts As stated by Chalmers (2019), the process of early childhood education is a complex one and the educators as well as the educational institutions are required to give adequate amount of attention both to the physical and the mental developments of the children. Hattie, Masters and Birch (2015) are of the viewpoint that the major objective of the early ch...

Read More Tags: Australia ECCDD301A Visual and Media arts [RDW] S2 2019 Other

• Course Code: 289ACC
• University: Coventry University
• Country: United Kingdom

Answer: Requirement for budget and strategic planning Budgeting and strategic planning are two major factors of a particular business enterprise. The business organizations regardless of sector, complexity, size all are heavily dependent on the budgetary systems and the budgets so that they can achieve the strategic goals. The process related to budgeting mainly involves setting of objectives and strategic goals and thereby developing costs, ...

Read More Tags: Australia Coventry Management University of New South Wales

• Course Code: 289ACC
• University: Coventry University
• Country: United Kingdom

Answer: Requirement for budget and strategic planning Budgeting and strategic planning are two major factors of a particular business enterprise. The business organizations regardless of sector, complexity, size all are heavily dependent on the budgetary systems and the budgets so that they can achieve the strategic goals. The process related to budgeting mainly involves setting of objectives and strategic goals and thereby developing costs, ...

Read More Tags: Australia Coventry Management University of New South Wales
Next

### Save Time & improve Grade

Just share Requriment and get customize Solution.

Orders

Overall Rating

Experts

### Our Amazing Features

#### On Time Delivery

Our writers make sure that all orders are submitted, prior to the deadline.

#### Plagiarism Free Work

Using reliable plagiarism detection software, Turnitin.com.We only provide customized 100 percent original papers.

#### 24 X 7 Live Help

Feel free to contact our assignment writing services any time via phone, email or live chat.

#### Services For All Subjects

Our writers can provide you professional writing assistance on any subject at any level.

#### Best Price Guarantee

Our best price guarantee ensures that the features we offer cannot be matched by any of the competitors.

### Our Experts

5/5

265 Order Completed

97% Response Time

### Ken Campbell

#### Wellington, New Zealand

5/5

184 Order Completed

96% Response Time

### Arapera Billing

#### Wellington, New Zealand

5/5

2115 Order Completed

97% Response Time

### Kimberley Chen

#### Singapore, Singapore

5/5

647 Order Completed

98% Response Time

### FREE Tools

#### Plagiarism Checker

Get all your documents checked for plagiarism or duplicacy with us.

#### Essay Typer

Get different kinds of essays typed in minutes with clicks.

#### Chemical Equation Balancer

Balance any chemical equation in minutes just by entering the formula.

#### Word Counter & Page Calculator

Calculate the number of words and number of pages of all your academic documents.

## Refer Just 5 Friends to Earn More than \$2000

1

1

1

### Live Review

Our Mission Client Satisfaction

Awesome work. Awesome response time. Very thorough & clear. Love the results I get with MAH!

#### User Id: 383727 - 31 Jul 2020

Australia

Work was done in a timely manner took it through grammarly checked for plagiarism very well satisfied

#### User Id: 463334 - 31 Jul 2020

Australia

Great work for the short notice given. Thank you for never disappointing and helping out.

#### User Id: 194216 - 31 Jul 2020

Australia

I received a full point on the assignment. Thank you for all the help with the assignment.

Australia