Using Big Data and Analytics for Forecasting Stock Market Return

Answered

The Design of the Infrastructure

An overview of the business

People in the society have become increasingly aware of the advantages of stock exchange. This knowledge has led to the rise in the degree of public involvement of stock business, causing the rise in stock return, volume of sales, increased threats, and risk with reported incidents of losses in monetary value, time and information among other resources, and the environmental destruction. This study explores the development of a system for using the big data and analytics for the forecasting of stock market return, which is defined as the value earned from the variation between the sales price and the buying price of any selected stock. Even though the developed countries maintain their sales volumes following the increased engagements in several economic sectors, the US stock market has continued to earn 7% to 17% of their GDP attributed to the mass sales.

Literature Review

There have been increased stock prices as well as market return in the past two decades, with a higher business capacity. The outcomes require lots of active and detailed view of the concealed cause and impact relationships. Systems have been used to apply the big data and holds large dataset of stock return dataset, with thousands of observations. The entire dataset of the taken, with real-time data. The system must use reliable statistics, which is realistic, covering the real information about stock return in the public stock business. Most of the studies use known stocks, of popular companies, trading in the coverage of the US market. The return on stock withing the collected data was used in the evaluation of the results of public involvement of stock business (Chapman et al, 2000). The performance of the stock business needs to be investigated in relation to the behavioural changes of people in the society. The forecasted with the models produce returns and risks, but some forecasts could be misleading. From analytics applied on the big data, human can make decisions to restructure stock prices and perform reviews as it directly affects the market returns.

Objective of this Study

The objective of this study is to produce a system of analytical models to utilize the big data analytics and machine learning for the forecasting of the values of the stock business activities. The tools used for data analysis include Power BI and Microsoft Excel. The system is meant to use machine learning methods and models to improve the degree of precision in forecasting the stock return considering stock prices and sales volume. In this view, the outcomes of data analysis through data mining models demonstrate their efficiency algorithms when used to forecast the consumer behaviour in the global business to affect returns on investment. The infrastructure for the proposed system in this study is shown below.

Literature Review

Infrastructure of the Big Data Analytics

Figure 1: Infrastructure of the Big Data Analytics

This section demonstrates the development of an artefact to show the analysis and visualisation of data, and its benefit to an organisation. The selected software, power BI, is used to develop various analytical dashboard and report to display information that organizations can easily apply from the infrastructure designed in Part 1. The real data generated as sample is shown below, in screenshot form.

Sample Data used for the Analysis

Figure 1: Sample Data used for the Analysis

The data generated is realistic, generated from www.yahoofinance.com, covering the latest period of the year 2020 and 2021. The data was selected and passed through several data cleaning steps to make the data relevant for application in the business operations. First, the data was passed through functions and procedure with the capacity to point to the missing data, repeated data, conflicting data values and outliers. The data collection was done in real time; hence, it is a true image of the US stock market.

Precision, depth, and strength are also factoring to consider while viewing the usability and reliability of the big data. In addition, the data size and spread are very important. The data should be having a sufficient count of observations to make it reliable and be able to remove the possibility of bias (Wu, 1997). To enhance the model validity, the dataset was segmented to form two sets, the testing set and the training set of data. The quality of data processing is improved by eliminate the use of too many variables used for building analytics system and minimizing the consumption of time when processing the information. Nevertheless, the reduced use of variables used leads to drop in the degree of accuracy in the forecasting.

The formulation of data for use for data analysis ends at the point of choosing the data mining methods, tools, and algorithms to be applied in the artificial intelligence. In the data cleaning, there is need for, four vital activities as seen below:

The models need to use only quantitative (measurable) data in the machine learning process and algorithm.
The second step is to remove all the outliers to develop normal spread and eliminate bias.
The data is to be segmented into two groups, training data, and test data.

From the result of the, the available data was used to present the histograms from with the market prices, which have the potentials of future growth and production of higher business returns.

Histogram of Closing Stock Prices

Figure 2: Histogram of Closing Stock Prices

For this section, the hypothesis is meant to test the relationship between the various variables in the global business and the stock return, with statistical and visual outcomes. The predictor variables in the dataset include Close (Close), Open (opening Price), Low (price), High (price), Volume (sales volume) and Adjusted Close (adjCls price). The hypothesis relates to the data set by stating the proposed relationship between each of the predictor variables and the stock return.

Objective of this Study

Hypotheses

Using the predictor variables, the hypotheses are stated below.

Alternative Hypothesis H1: Close has is positive predictor of the return on stock exchange business

Null Hypothesis H0: Close does not have is positive predictor of the return on stock exchange business

Open

Alternative Hypothesis H1: Opening Price had positive predictor of the return on stock exchange business

Null Hypothesis H0: Opening Price does not have is positive predictor of the return on stock exchange business

Low

Alternative Hypothesis H1: Low has is positive predictor of the return on stock exchange business

Null Hypothesis H0: Low does not have is positive predictor of the return on stock exchange business

High

Alternative Hypothesis H1: High Price had positive predictor of the return on stock exchange business

Null Hypothesis H0: High Price does not have is positive predictor of the return on stock exchange business

Volume

Alternative Hypothesis H1: Sales Volume has is positive predictor of the return on stock exchange business

Null Hypothesis H0: Sales Volume does not have is positive predictor of the return on stock exchange business

Adjusted Close

Alternative Hypothesis H1: Adjusted Close has is positive predictor of the return on stock exchange business

Null Hypothesis H0: Adjusted Close does not have is positive predictor of the return on stock exchange business

The dataset of return on stock and the sales volume covers the period of between 2020 and 2021. This data is stored in a spreadsheet and is imported into in the Power BI application for the purpose of conducing data analytics as well as visualization. The data shown in the analytics will allow companies to conduct necessary prediction of the return of stock market. The stock business generates the stock return as the dependent variable, generated gave after the cleaning of the original data.

Analysis of Sales

Sales Volume

The analysis shows that the sales volume in this dataset was in the range of between 1 billion a maximum limit of 9 billion. In this period from 2020 to 2021, the stock traders were involved in high degree of public participation and sales volume.

The histogram below represents sales volume experienced during the period,

Histogram of Sales

Figure 3: Histogram of Sales

The gap between the highest and the lowest sales volume in this project is an essential revelation from the evaluation of the data (Gregerson, 2019). The sales volume plays the most important role in realizing stock business return for the stock market. After the increase in the sales volume affecting the stock market return was the adjCls prices of the stocks since third significant variable in producing the stock return. From the spread of the stock business data and the regression analysis, the return of stock market return grows with the rise in the sales volume.

The Development of a Demonstrable Artefact

AdjCls stock:

Figure 4 below shows the statistical spread of the adjCls prices in the various stocks is given below in the graph.

AdjCls Prices

Figure 4: AdjCls Prices

Figure 4 above indicates that 74% of the adjCls prices were greater than the mean value of all the prices. This implies that approximately 74% of the adjCls stock prices take place when sales volume increase. sales amounts and rating of the sales level in the global business. The high percentage may be the outcome of the positive forecasting of the growth in the prices of stocks in the coming days and the efficiency of some market components like the precision of forecasting models for risk-takers.

Regression

The outcome of regression analysis plotting is presented below for the various predictor variables and the stock market return.

Regression Results Relating Earning on the stock of Open

Figure 5: Regression Results Relating Earning on the stock of Open

Regression Results Relating Earning on the stock of High

Figure 6: Regression Results Relating Earning on the stock of High

Regression Results Relating Earning on the stock of Low

Figure 7: Regression Results Relating Earning on the stock of Low

Regression Results Relating Earning on the stock of Adjusted Close

Figure 8: Regression Results Relating Earning on the stock of Adjusted Close

Regression Results Relating Earning on the stock of Sales Volume

Figure 9: Regression Results Relating Earning on the stock of Sales Volume

Regression Results Relating Earning on the stock of Close

Figure 10: Regression Results Relating Earning on the stock of Close

Discussion of the Hypothesis Tests:

The linear regression analysis carried out on the dataset is presented. The key data element in the regression analysis is the coefficients of association between the stock return and the various stock prices. The test for significance of the tests is done for each of the predictor variable using the p - values.

The sales volume gave positive coefficient (+0.42), indicating that the volume of sales had positive effect on return on investment of stock business market, or the sales volume can be used to give positive prediction of return on stock. The AdjCls stock price gave positive coefficient (+0.26), an indicator that the adjCls price had a is positive predictor of the return on stock exchange business. The stock Close gave positive coefficient of associations, (+0.44). This of the stock return Close had positive effect on return on investment of stock business market, or the Close can be uses as to predict positive values of the return on investment of stock business market. The Low stock price gave a negative coefficient (-0.24). This is indicating that the Low has a negative effect on return on investment of stock business market, or the Low gives positive prediction of the return on investment of stock business market. The High gave positive coefficient (+0.66). This is an indicator that the High has positive effect on return on investment of stock business market, or the High gives positive prediction of the return on investment of stock business market. The Opening price gave a negative coefficient (-0.20). This is indicating that the Opening price had a negative effect on return on investment of stock business market, or the Opening price gives positive prediction of the return on investment of stock business market.

The Testing of a Hypothesis

On the test of significance, all the statistical tests with all predictor variables (predictors variables) gave p – values above 0.05, at 95% degree of significance. The indication of this level of significance is that in all predictor variables, the process conducted to test the hypotheses, were not statistically significant and therefore, in all cases, the null hypotheses could not be rejected.

The failure to reject the null hypothesis in all the tests, implies that all predictor variables gave positive association with the return on investment of stock business except the volume of stock sales.

Sales Volume

The sales volume plays a significant role of stock market, being positive predictor of the prices of stock market as well as the return on investment of stock business.

Plotting the Probability Spreads

The normal probability of the dataset was plotted for each variable as shown below.

Probability Spread of Sales Volume

Figure 11: Probability Spread of Sales Volume

For the duration covered in this research, the normal probability spread for all predictor variables as given below.

Probability Spread of Opening Price

Figure 12: Probability Spread of Opening Price

Probability Spread of High

Figure 13: Probability Spread of High

Probability Spread of Low

Figure 14: Probability Spread of Low

Probability Spread of Close

Figure 15: Probability Spread of Close

Probability Spread of AdjCls Price

Figure 16: Probability Spread of AdjCls Price

Legal and ethical factors go hand in hand and cannot be separated. As it currently stands, there are chances of violation of the rights to privacy and freedom of expression among the stakeholders of the proposed system. First, the proposed system intends to use real time data that relates to people in the society, as well as disclosure of financial details of companies. Considering that the future development of the proposed system may include machine learning, there will be high probabilities of violation of data privacy and confidentiality bond. The legal aspects of research work requires that the entity that is interested in data acquisition must seek the consent of the company or individuals whose data they intend to use. Failure to comply with the consent requirement may lead to legal suits. To mitigate the risk of non-compliance with the legal and ethical concerns, the company dealing with the research needs to be conversant with the legal requirements regarding information and data management related to third parties (Zorzybsky, 1996). For the proposed system, the project uses data belonging to another party, and the analysis and interpretation. Additional concerns that may arise from machine learning is about intellectual property breach. For example, if the research is completed, there is need to patent the research in order to retain the rightful ownership of its originality.

At the point of development of the models and algorithms of machine learning, the predictor variables were applied for estimating their coefficients of association with the return on investment with stock business market to stabilize the future stock prices and return on investments. The fluctuation in the prices of stocks of stock business market, the probability of losses, and the acceptable amount of losses aids in classifying quality of the analysis of the outcomes as seen in the forecasted visual output of loss and volume of sales. The

Outcome of the regression analysis included the forecasted effect of losses.

Model Fit

In understanding of the historical and the current values of stock, the proposed system will be able to calculates and predict the future prices of the stock and the future returns on stock trading. It will involve the classification of data in a way to neutralize the non-systematic components of each of the observation. To estimate the short-term return on investment with stock market, the system will be able to apply model fitting.

Model building

The model is built through separating of the data to permit the formation and control of the analytical model and to control the model. The two separate datasets are the training and the training data. set for all objective variable of the forecasted empirical models (Breiman, 2001). After choosing the models, the datasets are trained and the forecasted before using the test data. The system will use the probability and impact of losses as the testing values of the parameter. The return produced from investment in stock was produced in every set of data, being a function of the close and open prices within the datasets.

After constructing the system, the performance of the model is evaluated to gauge the precision of the model for use in grouping tasks, but the report is analysed for the model's results. The quality of performance of the model is measured in the percentage error rate in the regression functions model.

Conclusion

From the data collected in the NYSE, the daily sales volumes of the stock between 2020 and 2021 were in the range of billions of US Dollars. The observation of the price trend in the outcome supports the process of decision-making and the business leadership in stock market and the whole corporate world to improve the returns on the stocks exchange business and produce competitive prices. This research showed the opportunity to forecast stock business return with algorithms of machine learning, tools, and methods. This study demonstrated the role of analytics in support of decision-making and in business performance, having a substantial effect on return on investment of stock business market. The outcome further demonstrated that through the combination of several attributes of stock business such as the sales volume and the adjusted close, there is an increase by about 72% in the accuracy level by which investors estimate the stock return. This study was successful in using the accessed dataset to forecast the actual effect of the stock prices and sales volume on the stock return.

References

Breiman L (2001). Statistical Modelling: The Two Cultures. Institute for Mathematical Stat******ics. Available at: http://www2.math.uu.se/~thulin/mm/breiman.pdf)

Chapman P. et al. (2000). CRISP-DM 1.0: Step by Step Data Mining Guide. Available at: https://pdfs.semanticscholar.org/5406/1a4aa0cb241a726f54d0569efae1c13aab3a.pdf?.

Gregerson, D. (2019). AI Hierarchy of Needs. slideshare. June. Available at: https://www.slideshare.net/DylanGregersen/data-science-hierarchy-of-needs(accessed 18-09-2021).

Wu, C. F. J. (1997). Statistics = Data Science? (inaugural lecture entitled "Stat******ics = Data Science?"for his appointment to the H. C. Carver Professorship at the University of Michigan).

Zorzybsky A. (1996). On Structure, In Science and Sanity: An Introduction to Non-Ar******otelian Systems and General Semantics, CD-ROM, ed. Charlotte Schuchardt-Read. Englewood, NJ: Institute of General Semantics. Available at: http://esgs.free.fr/uk/art/sands.htm(accessed 20-09-2021).

Get instant help from 5000+ experts for